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Preface 


Nota  Bene:  This  manuscript  has  not  yet  reached  its  final  form.  In  particular,  I  have  not  had  the 
opportunity  to  check  all  the  details  carefully  and  to  polish  the  writing  so  that  it  reflects  the  main 
points  as  brightly  as  possible.  At  this  stage,  the  citations  and  background  references  still  lack  the 
precision  that  I  try  to  bring  to  my  published  works.  I  welcome  any  comments  or  corrections  that 
may  help  improve  subsequent  versions  of  these  notes. 

These  lecture  notes  are  designed  to  bring  random  matrix  theory  to  the  people.  In  recent 
years,  random  matrices  have  come  to  play  a  major  role  in  computational  mathematics,  but  most 
of  the  classical  methods  for  studying  random  matrices  remain  the  province  of  experts.  Over 
the  last  decade,  with  the  advent  of  matrix  concentration  inequalities,  research  has  advanced  to 
the  point  where  we  can  conquer  many  (formerly)  challenging  problems  with  a  page  or  two  of 
arithmetic.  My  aim  is  to  describe  the  most  successful  methods  from  this  area  along  with  some 
interesting  examples  that  these  techniques  can  illuminate.  I  hope  that  the  results  in  these  pages 
will  inspire  future  work  on  applications  of  random  matrices  as  well  as  refinements  of  the  matrix 
concentration  inequalities  discussed  herein. 

As  with  any  extended  work,  my  own  interests  and  experience  necessarily  govern  the  con¬ 
tent.  In  other  words,  I  unapologetically  emphasize  the  projects  that  I  have  engaged  in  over  the 
last  five  years.  This  slant  is  not  intended  to  diminish  other  contributions  to  the  study  of  matrix 
concentration  inequalities  and  their  applications.  Indeed,  I  have  been  influenced  strongly  by 
the  work  of  many  researchers,  including  Rudolf  Ahlswede,  Rajendra  Bhatia,  lean  Bourgain,  Eric 
Carlen,  Sourav  Chatterjee,  Edward  Effros,  Elliott  Lieb,  Lester  Mackey,  Roberto  Oliveira,  Denes 
Petz,  Gilles  Pisier,  Mark  Rudelson,  Roman  Vershynin,  and  Andreas  Winter.  I  have  also  learned  a 
great  deal  from  other  colleagues  and  friends  along  the  way. 

I  gratefully  acknowledge  financial  support  from  the  Office  of  Naval  Research  under  awards 
N00014-08-1-0883  and  N00014-11-1002,  the  Air  Force  Office  of  Strategic  Research  under  award 
FA9550-09-1-0643,  and  an  Alfred  P.  Sloan  Fellowship.  Some  of  this  research  was  completed  at 
the  Institute  of  Pure  and  Applied  Mathematics  at  UCLA.  I  would  also  like  to  thank  the  California 
Institute  of  Technology  and  the  Moore  Foundation. 
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CHAPTER 


Introduction 


Random  matrix  theory  has  become  a  large  and  vital  field  of  probability,  and  it  has  found  appli¬ 
cations  in  a  wide  variety  of  other  areas.  To  motivate  the  results  in  these  notes,  we  begin  with  an 
overview  of  the  connections  between  random  matrix  theory  and  computational  mathematics. 
We  introduce  the  basic  ideas  underlying  our  approach,  and  we  state  one  of  our  main  results  on 
the  behavior  of  random  matrices.  As  an  application,  we  examine  the  properties  of  the  sample  co- 
variance  estimator,  a  random  matrix  that  arises  in  classical  statistics.  Afterward,  we  summarize 
the  other  types  of  results  that  appear  in  these  notes,  and  we  assess  the  novelties  in  this  presenta¬ 
tion. 


1.1  Historical  Origins 

Random  matrix  theory  sprang  from  several  different  sources  in  the  first  half  of  the  20th  century. 

Multivariate  Statistics.  One  of  the  earliest  examples  of  a  random  matrix  appeared  in  the  work 
of  John  Wishart  [Wis28] .  Wishart  was  studying  the  behavior  of  the  sample  covariance  esti¬ 
mator  for  the  covariance  matrix  of  a  multivariate  normal  random  vector.  He  showed  that 
the  estimator,  which  is  a  random  matrix,  has  the  distribution  that  now  bears  his  name. 
Statisticians  have  often  used  random  matrices  as  models  for  multivariate  data  [Mui82] . 

Numerical  Linear  Algebra.  In  their  remarkable  work  [vNG47,  GvN51]  on  computational  meth¬ 
ods  for  solving  systems  of  linear  equations,  von  Neumann  and  Goldstine  considered  a  ran¬ 
dom  matrix  model  for  the  floating  point  errors  that  arise  from  LU  decomposition.1  They 
obtained  an  high-probability  bound  for  the  norm  of  the  random  matrix,  which  they  took 
as  an  estimate  for  the  amount  of  error  the  procedure  might  typically  incur.  Curiously, 
in  subsequent  years,  numerical  linear  algebraists  became  very  suspicious  of  probabilis¬ 
tic  techniques,  and  only  in  recent  years  have  randomized  algorithms  reappeared  in  this 
field  [HMT11], 

1  It  is  breathtaking  that  von  Neumann  and  Goldstine  invented  and  analyzed  this  algorithm  before  they  had  any  digital 

computer  on  which  to  implement  it!  See  [Grcl  1]  for  a  historical  account. 
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Nuclear  Physics.  In  the  early  1950s,  physicists  had  reached  the  limits  of  deterministic  analyti¬ 
cal  techniques  for  modeling  the  energy  spectra  of  heavy  atoms  undergoing  slow  nuclear 
reactions.  Eugene  Wigner  was  the  first  researcher  to  surmise  that  a  random  matrix,  with 
appropriate  symmetries,  might  serve  as  a  suitable  model  for  the  Hamiltonian  of  the  quan¬ 
tum  mechanical  system  that  describes  the  reaction.  The  eigenvalues  of  this  random  ma¬ 
trix,  then,  would  model  the  possible  energy  levels  of  the  system.  See  Mehta’s  book  for  an 
account  of  all  this  [Meh04] . 

In  each  area,  the  motivation  was  quite  different  and  led  to  distinct  sets  of  questions.  Later, 
random  matrices  began  to  percolate  into  other  fields,  such  as  graph  theory  (the  Erdos-Renyi 
model  [ER60]  for  a  random  graph)  and  number  theory  (as  a  model  for  the  spacing  of  zeros  of  the 
Riemann  zeta  function  [Mon73]). 

1.2  The  Modern  Random  Matrix 

By  now,  random  matrices  are  ubiquitous.  They  arise  throughout  modern  mathematics  and 
statistics,  as  well  as  in  many  branches  of  science  and  engineering.  Random  matrices  have  sev¬ 
eral  different  purposes  that  we  may  wish  to  distinguish.  They  can  be  used  within  randomized 
computer  algorithms;  they  serve  as  models  for  data  and  for  physical  phenomena;  and  they  are 
subjects  of  mathematical  inquiry. 

1.2.1  Algorithmic  Applications 

The  striking  mathematical  properties  of  random  matrices  can  be  harnessed  to  develop  algo¬ 
rithms  for  solving  many  different  problems. 

Computing  Matrix  Approximations.  Random  matrices  provide  an  efficient  way  to  construct 
approximations  of  large  matrices.  For  example,  they  can  be  used  to  develop  fast  algo¬ 
rithms  for  computing  a  truncated  singular-value  decomposition.  In  this  application,  we 
multiply  a  large  input  matrix  by  a  smaller  random  matrix  to  extract  information  about  the 
dominant  singular  vectors  of  the  input  matrix.  See  the  paper  [HMT11]  for  an  overview  of 
these  ideas.  This  approach  has  been  very  successful  in  practice. 

Subsampling  of  Data.  One  method  that  has  been  used  in  large-scale  machine  learning  is  to  sub¬ 
sample  data  randomly  before  fitting  a  model.  For  instance,  we  can  combine  random  sam¬ 
pling  with  the  Nystrom  decomposition  to  approximate  a  kernel  matrix  efficiently  [Gitll]. 
The  success  of  this  approach  depends  on  the  properties  of  a  small  random  submatrix 
drawn  from  a  large,  fixed  matrix. 

Dimension  Reduction.  In  theoretical  computer  science,  a  common  algorithmic  template  in¬ 
volves  using  randomness  to  reduce  the  dimension  of  the  problem.  The  paper  [AC09]  de¬ 
scribes  an  approach  to  nearest  neighbor  computations,  based  on  random  projection  of  the 
input  data,  that  has  become  very  popular.  Random  matrix  theory  forms  a  core  part  of  the 
analysis. 

Sparsification.  One  way  to  accelerate  spectral  computations  on  large  matrices  is  to  replace  the 
original  matrix  by  a  sparse  proxy  that  has  similar  spectral  properties.  An  elegant  way  to 
produce  the  sparse  proxy  is  to  zero  out  entries  of  the  original  matrix  at  random  while 
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rescaling  the  entries  that  remain  [AM07].  This  idea  plays  an  important  role  in  Spielman 
and  Teng’s  work  on  fast  algorithms  for  solving  linear  systems  [ST04] . 

Combinatorial  Optimization.  One  way  to  solve  a  hard  combinatorial  optimization  problem  is 
to  replace  the  intractable  computation  with  a  related  optimization  problem  that  may  be 
more  tractable  [BTN01].  After  solving  the  easier  problem,  we  can  perform  a  randomized 
operation  to  obtain  an  approximate  solution  to  the  original  hard  problem.  For  optimiza¬ 
tion  problems  involving  matrices,  random  matrix  theory  is  central  to  the  analysis  [So09] . 

Compressed  Sensing.  Random  matrices  appear  as  measurement  operators  in  the  field  of  com¬ 
pressed  sensing  [Don06] .  When  acquiring  data  about  an  object  with  relatively  few  degrees 
of  freedom  as  compared  with  the  ambient  dimension,  we  can  sieve  out  the  important  in¬ 
formation  from  the  object  by  taking  a  small  number  of  random  measurements,  where  the 
number  of  measurements  is  comparable  too  the  number  of  degrees  of  freedom.  This  ap¬ 
plication  is  possible  because  of  geometric  properties  of  random  matrices  [CRPW12]. 


1.2.2  Modeling 

Random  matrices  also  appears  as  models  for  multivariate  data  or  multivariate  phenomena.  By 
studying  the  properties  of  these  models,  we  may  hope  to  obtain  an  understanding  of  the  average- 
case  behavior  of  a  data-analysis  algorithm  or  a  physical  system. 


Sparse  Approximation  for  Random  Signals.  Sparse  approximation  has  become  an  important 
problem  in  statistics,  signal  processing,  machine  learning  and  other  areas.  One  model  for 
a  “typical”  sparse  signal  involves  the  assumption  that  the  nonzero  coefficients  that  gener¬ 
ate  the  signal  are  chosen  at  random.  When  analyzing  methods  for  identifying  the  sparse 
set  of  coefficients,  we  must  study  the  behavior  of  a  random  column  submatrix  drawn  from 
the  model  matrix  [Tro08a,  Tro08b] . 

Demixing  of  Structured  Signals.  In  data  analysis,  it  is  common  to  encounter  a  superposition  of 
two  structured  signals,  and  the  goal  is  to  extract  the  two  signals  using  prior  information 
about  the  structures.  A  common  model  for  this  problem  assumes  that  the  signals  are  ran¬ 
domly  oriented  with  respect  to  each  other,  which  means  that  it  is  usually  possible  to  dis¬ 
criminate  the  underlying  structures.  Random  matrices  arise  in  the  analysis  of  estimation 
techniques  for  this  problem  [MT12]. 

High-Dimensional  Data  Analysis.  More  generally,  random  models  are  pervasive  in  the  analy¬ 
sis  of  statistical  estimation  procedures  for  high- dimensional  data.  Random  matrix  theory 
plays  a  key  role  in  this  field  [Koll  1,  BvdGl  1] . 

Wireless  Communication.  Random  matrices  are  commonly  used  as  models  for  wireless  chan¬ 
nels.  See  the  book  of  Tulino  and  Verdu  for  more  information  [TV04]. 


In  these  examples,  it  is  important  to  recognize  that  random  models  may  not  coincide  very  well 
with  reality,  but  they  allow  us  to  get  a  sense  of  what  might  be  possible  in  some  generic  cases. 
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1.2.3  Theoretical  Aspects 

Random  matrices  are  frequently  studied  for  their  intrinsic  mathematical  interest.  In  some  fields, 
they  provide  examples  of  striking  phenomena.  In  other  areas,  they  furnish  counterexamples  to 
“intuitive"  conjectures.  Here  are  a  few  disparate  problems  where  random  matrices  play  a  role. 

Combinatorics.  An  expander  graph  has  the  property  that  every  small  set  of  vertices  has  edges 
linking  it  to  a  large  proportion  of  the  vertices.  The  expansion  property  is  closely  related  to 
the  spectral  behavior  of  the  adjacency  matrix  of  the  graph.  The  easiest  construction  of  an 
expander  involves  a  random  matrix  [ASOO,  §9.2]. 

Algorithms.  For  worst-case  examples,  the  Gaussian  elimination  method  for  solving  a  linear  sys¬ 
tem  is  not  numerically  stable.  In  practice,  however,  this  is  a  non-issue.  One  explanation  for 
this  phenomenon  is  that,  with  high  probability,  a  small  random  perturbation  of  any  fixed 
matrix  is  well  conditioned.  As  a  consequence,  it  can  be  shown  that  Gaussian  elimination 
is  stable  for  most  matrices  [SST06] . 

High-Dimensional  Geometry.  Dvoretsky's  Theorem  states  that,  when  N  in  large,  the  unit  ball 
of  every  IV- dimensional  Banach  space  has  a  slice  of  dimension  n  ~  log  /V  that  is  close  to  a 
Euclidean  ball  with  dimension  n.  It  turns  out  that  a  random  slice  of  dimension  n  realizes 
this  property.  This  important  result  can  be  framed  as  a  statement  about  spectral  properties 
of  a  random  matrix  [Gor85] . 

Quantum  Information  Theory.  Random  matrices  appear  as  examples  and  counterexamples  for 
a  number  of  conjectures  in  quantum  information  theory.  We  refer  the  reader  to  the  pa¬ 
pers  [HW08,  Has09]  for  details. 

1.3  Random  Matrices  for  the  People 

Historically,  random  matrix  theory  has  been  regarded  as  a  very  challenging  field.  Even  now, 
many  well-established  methods  are  only  accessible  to  researchers  with  significant  experience, 
and  it  takes  months  of  intensive  effort  to  prove  new  results.  There  are  a  small  number  of  classes 
of  random  matrices  that  have  been  studied  so  completely  that  we  know  almost  everything  about 
them.  Yet,  moving  beyond  this  terra  firma,  one  quickly  encounters  examples  where  classical 
methods  are  brittle. 

We  intend  to  democratize  random  matrix  theory.  These  notes  describe  tools  that  deliver 
useful  information  about  a  wide  range  of  random  matrices.  In  many  cases,  a  modest  amount 
of  straightforward  arithmetic  leads  to  strong  results.  The  methods  here  should  be  accessible  to 
computational  scientists  working  in  a  variety  of  fields.  Indeed,  the  techniques  in  this  work  have 
already  found  an  extensive  number  of  applications.  Almost  every  week,  we  learn  about  a  paper 
that  uses  these  ideas  for  a  novel  purpose. 

1.4  Basic  Questions  in  Random  Matrix  Theory 

Although  it  sounds  prosaic,  random  matrices  merit  attention  precisely  because  they  are  matri¬ 
ces.  As  a  consequence,  random  matrices  have  spectral  properties:  eigenvalues  and  eigenvectors 
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in  the  case  of  square  matrices,  singular  values  and  singular  vectors  in  the  case  of  general  ma¬ 
trices.  The  most  basic  problems  all  concern  these  spectral  properties.  Here  are  some  questions 
that  we  might  ask: 

•  What  is  the  expectation  of  the  maximum  eigenvalue  of  a  random  symmetric  matrix?  What 
about  the  minimum  eigenvalue? 

•  How  are  the  extreme  eigenvalues  of  a  random  symmetric  matrix  distributed?  What  is  the 
probability  that  they  take  values  substantially  different  from  the  mean? 

•  What  is  the  expected  spectral  norm  of  a  random  matrix?  What  is  the  probability  that  the 
norm  takes  a  value  substantially  different  from  the  mean? 

•  What  about  the  other  eigenvalues  or  singular  values?  Can  we  say  something  about  the 
“typical”  spectrum  of  a  random  matrix? 

•  Can  we  say  anything  about  the  eigenvectors  or  singular  vectors?  For  instance,  is  each  one 
distributed  uniformly  on  the  sphere? 

•  We  can  also  ask  questions  about  the  operator  norm  of  a  random  matrix  acting  as  a  map  be¬ 
tween  two  normed  linear  spaces.  In  this  case,  the  geometry  of  the  domain  and  codomain 
play  an  important  role. 

In  this  work,  we  focus  on  the  first  three  questions  above.  We  study  the  expectation  of  the  extreme 
eigenvalues  of  random  symmetric  matrices,  and  we  attempt  to  provide  bounds  on  the  probabil¬ 
ity  that  they  take  an  unusual  value.  As  an  application  of  these  results,  we  show  how  to  control 
the  expected  spectral  norm  of  a  general  matrix  and  to  bound  the  probability  of  a  large  deviation. 
These  are  the  most  important  issues  for  most  (but  not  all!)  applications.  We  will  not  touch  on 
the  remaining  questions. 

1.5  Random  Matrices  as  Independent  Sums 

Our  approach  to  random  matrices  depends  on  a  fundamental  principle: 

In  applications,  it  is  common  that  a  random  matrix  can  be  expressed  as  a  sum  of 
independent  random  matrices. 

The  applications  that  appear  in  these  notes  should  provide  ample  evidence  for  this  claim.  For 
now,  let  us  describe  a  specific  problem  that  will  serve  as  a  running  example  throughout  the  In¬ 
troduction.  We  hope  this  example  is  complicated  enough  to  be  interesting  but  simple  enough  to 
illustrate  the  main  points  clearly. 

1 .5.1  Example:  A  Sample  Covariance  Matrix 

Let  x  =  {X\ , . . . ,  Xp)  be  a  random  vector  with  zero  mean  E*  =  0.  Assume  that  the  Euclidean  norm 
of  the  distribution  is  bounded:  ||x||2  <  B.  The  covariance  of  the  random  vector  x  is  the  positive- 
semidefinite  matrix 

P 

A=E(xx*)=  £  UXjX*k)Ejk 

j,k=  1 


(1.5.1) 
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In  other  words,  the  ( j,  k)  entry  of  the  sample  covariance  records  the  covariance  between  the  / th 
and  kth  entry  of  the  vector. 

One  basic  problem  in  statistical  practice  is  to  estimate  the  covariance  matrix  from  data. 
Imagine  that  we  have  access  to  n  independent  samples  x\,...,xn,  distributed  the  same  way  as 
x.  The  sample  covariance  estimator  is  defined  as  random  matrix 

r=-f>fcx*.  (1.5.2) 

n  k=  1 

The  random  matrix  Y  is  an  unbiased  estimator  for  the  sample  covariance  matrix:  E  Y  —  A.  The 
formula  (1.5.2)  supposes  that  the  random  vector  x  is  known  to  have  zero  mean;  in  general,  we 
would  have  to  make  some  adjustments  to  incorporate  an  estimate  for  the  sample  mean.  To 
emphasize, 

The  sample  covariance  estimator  Y  can  be  expressed  as  a  sum  of  independent  ran¬ 
dom  matrices. 

This  is  precisely  the  type  of  decomposition  that  our  tools  require. 

1.6  Exponential  Concentration  Inequalities  for  Matrices 

An  important  challenge  in  classical  probability  is  to  study  the  probability  that  a  random  variable 
Z  takes  a  value  substantially  different  from  its  mean.  That  is,  we  seek  a  bound  of  the  form 

P{|Z-EZ|>f}<  ???  (1.6.1) 

for  a  positive  parameter  f.  When  Z  is  expressed  as  a  sum  of  independent  random  variables,  the 
literature  contains  many  tools  for  addressing  this  problem. 

For  a  random  matrix  Z,  a  variant  of  (1.6.1)  is  the  question  of  whether  Z  deviates  substantially 
from  its  mean  value.  We  might  frame  this  question  as 

PIIIZ-EZH  >  f}<  ???  .  (1.6.2) 

Here  and  elsewhere,  ||  •  ||  denotes  the  spectral  norm  of  a  matrix.  As  noted,  it  is  frequently  possible 
to  decompose  Z  as  a  sum  of  independent  random  matrices.  We  might  even  dream  that  the 
classical  methods  for  studying  the  scalar  concentration  problem  (1.6.1)  extend  to  (1.6.2). 

1.6.1  The  Bernstein  Inequality 

To  explain  what  kind  of  results  we  have  in  mind,  we  return  to  the  scalar  problem  (1.6.1).  Suppose 
that  we  can  express  the  real  random  variable  Z  as  a  sum  of  independent  real  random  variables. 
To  control  Z,  we  rely  on  two  types  of  information:  global  properties  of  the  sum  (such  as  its  mean 
and  variance)  and  local  properties  of  the  summands  (such  as  their  maximum  fluctuation) .  These 
pieces  of  data  are  usually  easy  to  obtain.  Together,  they  guarantee  that  Z  concentrates  sharply 
around  its  mean  value. 

Theorem  1.6.1  (Bernstein  Inequality).  LetS\,...,Sn  be  independent  random  variables  that  have 
bounded  deviation  from  their  mean  values: 


|Sfc-EStl<l?  for  each  k  - 
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Form  the  sum  Z  =  £?=1  S/,,  and  introduce  a  variance  parameter  a2  =  E[(Z  -  EZ)2] .  Then 

I  -t2l  2  \ 

P{[Z- EZ[  >  f}  <  2  exp  I  — — forallt>0. 

See  the  survey  paper  [Lug09]  for  a  proof  of  this  result. 

We  refer  to  Theorem  1.6.1  as  an  exponential  concentration  inequality  because  it  yields  expo¬ 
nentially  decaying  bounds  on  the  probability  that  Z  deviates  substantially  from  its  mean.  More 
precisely,  the  result  implies  that  the  probability  that  the  sum  Z  exhibits  a  moderate  deviation 
( f  <  a2  /  R)  decays  like  the  tail  of  a  normal  random  variable  with  variance  a2 .  The  probability  that 
the  sum  Z  exhibits  a  large  deviation  (f  >  cr2IR )  decays  like  an  exponential  random  variable  with 
mean  R. 

1 .6.2  The  Matrix  Bernstein  Inequality 

What  is  truly  astonishing  is  that  the  scalar  Bernstein  inequality,  Theorem  1.6.1,  lifts  directly  to 
matrices.  Let  us  emphasize  this  remarkable  fact: 

There  are  exponential  concentration  inequalities  for  the  spectral  norm  of  a  sum 
of  independent  random  matrices. 

As  a  consequence,  once  we  decompose  a  random  matrix  as  an  independent  sum,  we  can  harness 
global  properties  (such  as  the  mean  and  the  variance)  and  local  properties  (such  as  a  uniform 
bound  on  the  summands)  to  obtain  detailed  information  about  the  norm  of  the  sum.  As  in  the 
scalar  case,  it  is  usually  easy  to  acquire  the  input  data  for  the  inequality.  But  the  output  of  the 
inequality  is  highly  nontrivial. 

To  illustrate  this  point,  we  state  one  of  the  major  results  from  these  notes.  This  theorem  is 
a  matrix  extension  of  Bernstein’s  inequality  that  was  developed  independently  in  the  two  pa¬ 
pers  [OlilOa,  Trolld].  After  presenting  the  result,  we  give  some  more  details  about  its  interpreta¬ 
tion.  In  the  next  section,  we  apply  this  result  to  study  the  covariance  estimation  problem. 

Theorem  1.6.2  (Matrix  Bernstein).  LetS\,...,Sn  be  independent  random  matrices  with  common 
dimension  d\  x  d2.  Assume  that  each  matrix  has  bounded  deviation  from  its  mean: 

II  Sk  -  E  Sjc  ||  <  R  for  each  k-  1, . . . ,  n. 

Form  the  sum  Z  =  £?=1  Syt,  and  introduce  a  variance  parameter 

<T2  =  max{||E[(Z-EZ)(Z-EZ)*]||,  ||E[(Z-  EZ)*  (Z-  EZ)]  ||} . 

Then 

I  -r2/ 2  j 

P{||Z-  EZ||  >  t}<  [d\  +  df)  ■  exp  I  — ^ fora^  t>0. 

Furthermore, 

E||Z-  EZ||  <  \j2cr2  log(di  +  df)  +  -filog(di  +  d2 ). 

The  proof  of  this  result  appears  in  Chapter  6. 

To  appreciate  what  Theorem  1.6.2  means,  it  is  valuable  to  make  a  direct  comparison  with  the 
scalar  version,  Theorem  1.6.1.  In  both  cases,  we  express  the  object  of  interest  as  an  independent 
sum,  and  we  instate  a  uniform  bound  on  the  summands.  There  are  three  salient  changes: 
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•  The  variance  parameter  cr2  in  the  result  for  matrices  can  be  interpreted  as  the  magnitude  of 
the  expected  squared  deviation  from  the  mean.  The  formula  reflects  the  fact  that  a  matrix 
B  has  two  different  squares  BB*  and  B  B. 

•  The  tail  bound  has  a  dimensional  factor  d\  +  d2  that  depends  on  the  size  of  the  matrix.  This 
factor  reduces  to  two  in  the  scalar  setting.  In  the  matrix  case,  it  limits  the  range  of  t  where 
tail  bound  is  informative. 

•  We  have  included  a  bound  for  the  expected  deviation  \\Z-  EZ||.  This  estimate  is  not  par¬ 
ticularly  interesting  the  scalar  setting,  but  it  is  usually  quite  challenging  to  prove  results  of 
this  type  for  matrices.  In  fact,  we  often  find  the  expectation  bound  more  useful  than  the 
tail  bound. 

For  further  discussion  of  this  result,  turn  to  Chapter  6.  Chapters  4  and  7  contain  related  results 
and  interpretations. 

1.6.3  Example:  A  Sample  Covariance  Matrix 

The  reader  may  not  yet  perceive  why  abstract  matrix  inequalities,  such  as  Theorem  1.6.2,  deliver 
information  about  random  matrices  that  arise  in  practice.  Our  burden  remains  to  show  that  the 
results  are  worthwhile. 

We  will  apply  the  matrix  Bernstein  inequality,  Theorem  1.6.2,  to  measure  how  well  a  sam¬ 
ple  covariance  matrix  approximates  the  true  covariance  matrix.  As  before,  let  x  be  a  zero-mean 
random  vector  with  dimension  p,  and  assume  that  the  Euclidean  norm  of  the  distribution  is 
bounded:  ||x||2  <  B.  The  covariance  matrix  of  the  vector  is  A  —  E(jcjc*).  Suppose  we  have  n  inde¬ 
pendent  samples  xi,...,xn  with  the  same  distribution  as  x.  We  can  form  the  sample  covariance 
matrix 

1  n 

r=-£>fcx*. 

n  k=  1 

Our  goal  is  to  study  the  spectral-norm  distance  ||  Y  -  A\\  between  the  sample  covariance  and  the 
true  covariance. 

To  that  end,  let  us  express  the  error  matrix  as  a  sum  of  independent  random  matrices: 

E=Y-A=f.Sk. 

k=  1 

where  Sk  =  n~l{xkx*k  -  A)  for  each  index  1c.  To  apply  the  matrix  concentration  inequality,  we 
must  bound  the  norm  of  each  summand,  and  we  must  compute  the  variance  of  the  matrix  E. 

To  obtain  the  uniform  bound,  observe  that 

ESfc  =  0  and  ||Stll<— . 

n 

We  reach  the  latter  inequality  as  follows: 

II St II  =  -  S  -  (||xt||2  +  E||x||2)  <  — . 

n  n  n 

The  first  bound  follows  from  the  triangle  inequality  for  the  spectral  norm  and  lensen’s  inequality. 
The  second  relies  on  the  uniform  bound  for  the  norm  of  a  random  vector  distributed  as  x. 
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Next,  we  must  find  a  bound  for  the  matrix  variance  o2{E).  Let  us  calculate  that 


E(S|)=  —^E[(xkxl  -  A)1}  =  —^E[\\xkf-xkx*k-{xkxl'jA-A{xkx*k)  +  Az 


=4—~[B-  E  {xkxl)  -  A2 


■  A2  +  A2}  4  A. 


The  expression  H  =4  M  means  that  M  -  H  is  positive  semidefinite.  This  argument  relies  on  the 
uniform  upper  bound  for  the  norm  of  the  random  vector.  From  here,  we  quickly  obtain  the 
variance  o2  (E) : 


o2(E)  =  ||E(£s2)||  = 


LE(S2fc) 


k=  l 


B 

<-•11  A||. 
n 


The  second  relation  depends  on  the  fact  that  the  summands  are  independent  and  zero  mean. 
The  inequality  is  valid  because  0  =<!  H  =4  M  implies  that  the  norm  of  M  exceeds  that  norm  of  H. 
Now,  we  may  invoke  Theorem  1.6.2  to  obtain 


V  n  3n 


In  other  words,  the  error  in  approximating  the  sample  covariance  matrix  is  not  too  large  when 
we  have  a  sufficient  number  of  samples.  If  we  wish  to  obtain  a  relative  error  of  e,  where  e  e  (0, 1], 
we  may  take 


n  >  Const  • 


Blogp 
e2  II  AH' 


This  selection  yields 


E  ||  K  -  A||  <  Const  -  e  - 1|  A|| . 


It  is  often  the  case  that  B  -  Const  •  p,  so  we  discover  that  n  -  Const  •  £-2plogp  samples  suffice 
to  estimate  the  covariance  matrix  A  accurately.  This  bound  is  qualitatively  sharp  for  worst-case 
distributions. 


1 .6.4  History  of  this  Example 

Covariance  estimation  may  be  the  earliest  application  of  matrix  concentration  bounds  in  ran¬ 
dom  matrix  theory.  Rudelson  [Rud99]  showed  how  to  use  the  noncommutative  Khintchine  in¬ 
equality  [LP86,  LPP91,  BucOl,  Buc05]  to  obtain  essentially  optimal  bounds  on  the  sample  covari¬ 
ance  estimator  for  a  bounded  random  vector.  The  tutorial  [Verl2]  of  Roman  Vershynin  provides 
an  excellent  overview  of  this  problem  as  well  as  many  results  and  references. 

The  analysis  of  the  sample  covariance  matrixhere  is  adapted  from  the  paper  [GT1 1] .  It  leads 
to  essentially  the  same  result  as  Rudelson  obtained  in  [Rud99] .  For  an  analysis  of  sparse  co- 
variance  estimation  using  matrix  concentration  inequalities,  see  the  paper  [CGT12a]  and  the 
technical  report  [CGT12b], 

1.7  The  Arsenal  of  Results 

The  classical  literature  contains  many  exponential  tail  bounds  for  sums  of  independent  random 
variables.  Some  of  the  best  known  results  are  the  Bernstein  inequality  and  the  Chernoff  inequal¬ 
ity,  but  there  are  many  more.  It  turns  out  that  essentially  all  of  these  results  admit  extensions 
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that  hold  for  random  matrices.  These  lecture  notes  focus  on  some  exponential  concentration 
inequalities  for  matrices  that  have  already  found  significant  applications. 

Matrix  Gaussian  Series.  A  matrix  Gaussian  series  is  a  random  matrix  that  can  be  expressed  as  a 
sum  of  fixed  matrices  weighted  by  independent  standard  normal  random  variables.  This 
formulation  includes  a  surprising  number  of  examples.  The  most  important  are  undoubt¬ 
edly  Wigner  matrices  and  rectangular  Gaussian  matrices.  Other  interesting  cases  include 
a  Toeplitz  matrix  with  Gaussian  entries.  This  material  appears  in  Chapter  4. 

Matrix  Rademacher  Series.  A  matrix  Rademacher  series  is  a  random  matrix  that  can  be  written 
as  a  sum  of  fixed  matrices  weighted  by  independent  Rademacher  random  variables.2  This 
construction  includes  things  like  random  sign  matrices,  as  well  as  a  fixed  matrix  whose 
entries  are  modulated  by  random  signs.  There  are  also  interesting  examples  that  arise  in 
combinatorial  optimization.  We  treat  these  problems  in  Chapter  4. 

Matrix  Chernoff  Bounds.  The  matrix  Chernoff  bounds  apply  to  random  matrices  that  can  be 
decomposed  as  a  sum  of  independent  positive-semidefinite  random  matrices  whose  max¬ 
imum  eigenvalues  are  subject  to  a  uniform  bound.  These  results  are  appropriate  for  study¬ 
ing  the  Laplacian  matrix  of  a  random  graph.  They  also  allow  us  to  obtain  information 
about  the  norm  of  a  random  submatrix  drawn  from  a  fixed  matrix.  See  Chapter  5. 

Matrix  Bernstein  Bounds.  Matrix  Bernstein  inequalities  concern  random  matrices  that  can  be 
expressed  as  a  sum  of  independent  bounded  random  matrices  that  are  bounded  in  norm. 
These  results  have  many  applications,  including  the  analysis  of  randomized  algorithms  for 
approximate  matrix  multiplication  and  randomized  algorithms  for  matrix  sparsification. 
Chapter  6  contains  this  material. 

Intrinsic  Dimension  Bounds.  Some  matrix  concentration  inequalities  can  be  improved  when 
the  random  matrix  has  limited  spectral  content  in  most  dimensions.  In  this  situation,  we 
may  be  able  to  obtain  bounds  that  do  not  depend  on  the  ambient  dimension.  See  Chap¬ 
ter  7  for  details. 

The  literature  describes  other  exponential  matrix  inequalities  for  sums  of  independent  ran¬ 
dom  matrices.  These  include  a  matrix  Bennett  inequality  [Trolld,  §6] ,  matrix  Bernstein  inequal¬ 
ities  for  unbounded  random  matrices  [Trolld,  §6],  and  a  matrix  Hoeffding  inequality  [Trolld, 
§7] .  These  results  extend  to  give  bounds  for  matrix- valued  martingales,  such  as  the  matrix  Azuma 
and  McDiarmid  inequalities  [Trolld,  §7]  and  the  matrix  Freedman  inequality  [OlilOa,  Trolla]. 

Furthermore,  the  paper  [MJC+ 12]  develops  a  very  different  technique  that  can  yield  matrix 
concentration  inequalities  for  random  matrices  based  on  dependent  random  variables.  The  re¬ 
sults  in  this  work  include  several  exponential  inequalities.  This  approach  also  leads  to  polyno¬ 
mial  concentration  inequalities,  which  can  be  viewed  as  a  generalization  of  Chebyshev’s  inequal¬ 
ity.  See  the  annotated  bibliography  for  more  information. 

1.8  These  Lecture  Notes 

These  lecture  notes  are  intended  for  researchers  and  graduate  students  in  computational  math¬ 
ematics  who  want  to  learn  some  modern  techniques  for  analyzing  random  matrices.  The  prepa¬ 
ration  required  is  minimal.  We  assume  familiarity  with  calculus,  applied  linear  algebra,  the  basic 

2  A  Rademacher  random  variable  is  uniformly  distributed  on  {+ 1[. 
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theory  of  normed  spaces,  and  classical  probability  theory  up  through  the  basic  concentration 
inequalities  (such  as  Markov  and  Bernstein). 

The  material  here  is  based  primarily  on  the  paper  “User-Friendly  Tail  Bounds  for  Sums  of 
Random  Matrices”  by  the  present  author  [Trolld].  There  are  several  significant  revisions  to  this 
earlier  work: 

Examples  and  Applications.  Many  of  the  papers  on  matrix  concentration  give  limited  informa¬ 
tion  about  how  the  results  can  be  used  to  solve  problems  of  interest.  A  major  part  of  these 
notes  consists  of  worked  examples  and  applications  that  indicate  how  matrix  concentra¬ 
tion  inequalities  are  used  in  practice. 

Expectation  Bounds.  This  work  collects  bounds  for  the  expected  value  of  the  spectral  norm  of 
a  random  matrix  and  bounds  for  the  expectation  of  the  smallest  and  largest  eigenvalues  of 
a  random  symmetric  matrix.  Some  of  these  useful  results  have  appeared  piecemeal  in  the 
literature  [CGT12a,  MJC+12],  but  they  have  not  been  included  in  a  unified  presentation. 

Intrinsic  Dimension  Bounds.  Over  the  last  few  years,  there  have  been  some  refinements  to  the 
basic  matrix  concentration  bounds  that  improve  the  dependence  on  dimension  [HKZ12b, 
Mini  1] .  We  describe  a  new  framework  that  allows  us  to  prove  these  results  with  ease. 

Annotated  Bibliography.  We  have  included  a  list  of  the  main  works  on  matrix  concentration, 
including  a  short  summary  of  the  main  contributions  of  these  papers.  We  hope  this  list 
will  be  a  valuable  guide  for  further  reading,  even  though  it  remains  incomplete. 

The  organization  of  the  notes  is  straightforward.  Chapter  2  contains  background  material 
that  is  needed  for  the  proofs.  Chapter  3  describes  the  framework  for  developing  exponential 
concentration  inequalities  for  matrices.  Chapter  4  presents  the  first  set  of  results  and  examples, 
concerning  matrix  Gaussian  and  Rademacher  series.  Chapter  5  introduces  the  matrix  Chernoff 
bounds  and  their  applications,  and  Chapter  6  expands  on  our  discussion  of  the  matrix  Bern¬ 
stein  inequality.  Chapter  7  shows  how  to  sharpen  some  of  the  results  so  that  they  depend  on 
an  intrinsic  dimension  parameter.  We  conclude  with  resources  on  matrix  concentration  and  a 
bibliography. 

Since  these  are  lecture  notes,  we  have  not  followed  all  of  the  conventions  for  scholarly  articles 
in  journals.  In  particular,  almost  all  the  citations  appear  in  the  notes  at  the  end  of  each  chapter. 
Our  aim  has  been  to  explain  the  ideas  as  clearly  as  possible,  rather  than  to  interrupt  the  narrative 
with  an  elaborate  genealogy  of  results.  In  the  current  version,  these  notes  are  still  not  as  polished 
and  complete  as  we  might  like,  and  we  intend  to  expand  them  in  future  revisions. 


Matrix  Functions  and  Probability 

with  Matrices 


We  begin  the  main  development  with  a  short  overview  of  the  background  material  that  is  re¬ 
quired  to  understand  the  proofs  and,  to  a  lesser  extent,  the  statements  of  matrix  concentration 
inequalities.  We  have  been  careful  to  provide  detailed  cross-references  to  these  foundational 
results,  so  most  readers  will  be  able  to  proceed  directly  to  the  main  theoretical  development  in 
Chapter  3  or  the  discussion  of  specific  random  matrix  inequalities  in  Chapters  4,  5,  and  6. 

Section  2. 1  below  covers  material  from  matrix  theory  concerning  the  behavior  of  matrix  func¬ 
tions.  Section  2.2  reviews  some  relevant  results  from  probability,  especially  the  parts  involving 
matrices. 

2. 1  Matrix  Theory  Background 

Most  of  these  results  are  drawn  from  Bhatia’s  excellent  books  on  matrix  analysis  [Bha97,  Bha07]. 
The  books  [HJ85,  HJ94]  of  Horn  and  Johnson  also  serve  as  good  general  references.  Higham’s 
work  [Hig08]  is  a  generous  source  of  information  about  matrix  functions. 

2.1.1  Conventions 

A  matrix  is  a  finite,  two-dimensional  array  of  complex  numbers.  Many  parts  of  the  discussion  do 
not  depend  on  the  size  of  a  matrix,  so  we  specify  dimensions  only  when  it  matters.  Readers  who 
wish  to  think  about  real-valued  matrices  will  find  that  none  of  the  results  require  any  essential 
modification  in  this  setting. 

2.1.2  Spaces  of  Matrices 

Complex  matrices  with  fixed  dimensions  form  a  linear  space  because  we  can  add  them  and  mul¬ 
tiply  them  by  complex  scalars.  We  write  W d1xd2  f°r  the  linear  space  of  d\  x  d-/  matrices.  In  ad¬ 
dition  to  the  usual  linear  operations,  we  can  multiply  square  matrices,  so  they  form  an  algebra. 
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We  write  for  the  algebra  of  d  x  d  square,  complex  matrices.  The  set  consists  of  Hermitian 
matrices  with  dimension  d;  it  is  a  linear  space  over  the  real  field.  That  is,  we  can  add  Hermi¬ 
tian  matrices  and  multiply  them  by  real  numbers.  Multiplication  by  a  complex  scalar  is  verboten 
inside  H^.  We  rarely  require  this  notation,  but  it  is  occasionally  important  for  clarity. 

2.1.3  Basic  Matrices 

We  write  0  for  the  zero  matrix  and  I  for  the  identity  matrix.  Occasionally,  we  add  a  subscript  to 
specify  the  dimension.  For  instance,  l,t  is  the  d  x  d  identity. 

The  standard  basis  for  the  linear  space  Ml ^xdn  is  comprised  of  unit  matrices.  We  write  E 
for  the  unit  matrix  with  a  one  in  position  ( j,  k )  and  zeros  elsewhere.  We  use  a  related  notation 
for  unit  vectors.  The  symbol  denotes  a  column  vector  with  a  one  in  position  k  and  zeros 
elsewhere.  The  dimensions  of  unit  matrices  and  unit  vectors  are  typically  determined  by  the 
context. 

A  square  matrix  that  satisfies  QQ*  -  I  =  Q*  Q  is  called  unitary.  We  reserve  the  symbol  Q  for  a 
unitary  matrix.  The  symbol  *  denotes  the  conjugate  transpose. 

Readers  who  prefer  the  real  setting  may  prefer  to  regard  Q  as  an  orthogonal  matrix  and  to 
interpret  *  as  the  (simple)  transpose  operation. 

2.1.4  Hermitian  Matrices  and  Eigenvalues 

A  square  matrix  that  satisfies  A  =  A*  is  called  Hermitian.  We  adopt  Parlett’s  convention  that  bold 
Latin  and  Greek  letters  that  are  symmetric  around  the  vertical  axis  [A,  H,  ...,  Y;  A,  0,  . . . ,  O) 
always  represent  Hermitian  matrices. 

Each  Hermitian  matrix  A  has  an  eigenvalue  decomposition 

A  -  QAQ*  with  Q  unitary  and  A  real  diagonal.  (2.1.1) 

The  diagonal  entries  of  A  are  called  the  eigenvalues  of  A.  The  unitary  matrix  Q  in  the  eigen¬ 
value  decomposition  is  not  completely  determined,  but  the  list  of  eigenvalues  is  unique  modulo 
permutations.  The  eigenvalues  of  an  Hermitian  matrix  are  often  referred  to  as  its  spectrum. 

We  denote  the  algebraic  minimum  and  maximum  eigenvalues  of  an  Hermitian  matrix  A  by 
Amin  (A)  and  Amax(A).  The  extreme  eigenvalue  maps  are  positive  homogeneous: 

Amin(0A)  =  eX^mlA)  and  Amax(0A)  =  6Amax(A)  for0>O.  (2.1.2) 

There  is  an  important  relationship  between  minimum  and  maximum  eigenvalues: 

Amin(-A)  = -Amax(A).  (2.1.3) 

The  fact  (2.1.3)  warns  us  that  we  must  be  careful  passing  scalars  through  an  eigenvalue  map. 

Readers  who  prefer  the  real  setting  may  read  “symmetric”  in  place  of  “Hermitian.”  In  this 
case,  the  eigenvalue  decomposition  involves  an  orthogonal  matrix  Q.  Note,  however,  that  the 
term  “symmetric”  has  a  different  meaning  in  probability! 

2.1.5  The  Trace  of  a  Square  Matrix 

The  trace  of  a  square  matrix,  denoted  by  tr,  is  the  sum  of  its  diagonal  entries. 

d 

trB  =  Y  bjj  for  a  d  x  d  matrix  B. 
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The  trace  is  unitarily  invariant: 

tr  B  -  tr(Q*  BQ)  for  each  square  matrix  B  and  each  unitary  Q. 

In  particular,  the  existence  of  an  eigenvalue  decomposition  (2.1.1)  shows  that  the  trace  of  an 
Hermitian  matrix  equals  the  sum  of  its  eigenvalues.  This  fact  also  holds  true  for  a  general  square 
matrix. 

2.1.6  The  Semidefinite  Order 

An  Hermitian  matrix  A  with  nonnegative  eigenvalues  is  positive  semidefinite.  When  each  eigen¬ 
value  is  strictly  positive,  we  say  that  the  matrix  A  is  positive  definite.  Positive  semidefinite  ma¬ 
trices  play  a  special  role  in  matrix  theory,  analogous  to  the  role  of  nonnegative  numbers  in  real 
analysis. 

The  set  of  positive-semidefinite  matrices  with  size  d  forms  a  closed,  convex  cone  in  the  real- 
linear  space  of  Hermitian  matrices  of  dimension  d.  Therefore,  we  may  define  the  semidefinite 
partial  order  on  Hermitian  matrices  of  the  same  size  by  the  rule 

A^.  H  <=>  H  -  A  is  positive  semidefinite. 

In  particular,  we  write  A  !>=  0  to  indicate  that  A  is  positive  semidefinite  and  A  >  0  to  indicate  that 
A  is  positive  definite.  For  a  diagonal  matrix  A,  the  expression  A  0  means  that  each  entry  of  A 
is  nonnegative. 

The  semidefinite  order  is  preserved  by  conjugation,  a  fact  whose  importance  cannot  be  over¬ 
stated. 

Proposition  2.1.1  (Conjugation  Rule).  Let  A  and  H  be  Hermitian  matrices  of  the  same  size,  and 
letB  be  a  general  matrix  with  conforming  dimensions.  Then 

A4H  =>  BAB*  4BHB*  (2.1.4) 

Finally,  we  remark  that  the  trace  of  a  positive-semidefinite  matrix  is  at  least  as  large  as  its 
maximum  eigenvalue: 


■^max(-'l)  <  tr  A  when  A  is  positive  semidefinite.  (2.1.5) 

This  property  follows  from  the  definition  of  a  positive-semidefinite  matrix  and  the  fact  that  the 
trace  of  A  is  the  sum  of  the  eigenvalues. 

2.1.7  Standard  Matrix  Functions 

Let  us  describe  the  most  direct  method  for  extending  a  function  on  the  reals  to  a  function  on 
Hermitian  matrices.  The  basic  idea  is  to  apply  the  function  to  each  eigenvalue  of  the  matrix  to 
construct  a  new  matrix. 

Definition  2.1.2  (Standard  Matrix  Function).  Let  f  :  I  where  I  is  an  interval  of  the  real  line. 
Let  A  be  a  d  x  d  Hermitian  matrix  with  eigenvalues  in  I.  Define  the  d  x  d  Hermitian  matrix  /(A) 
via  the  eigenvalue  decomposition  of  A: 


Ai 

fU  1) 

^d. 

Q*  =*  f  (A)  =  Q 

fUdl 

A-Q 
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In  particular,  we  can  apply  f  to  a  real  diagonal  matrix  by  applying  the  function  to  each  diagonal 
entry. 

It  can  be  verified  that  the  definition  of  f(A)  does  not  depend  on  which  eigenvalue  decomposi¬ 
tion  A  -  QAQ*  that  we  choose.  Any  matrix  function  that  arises  in  this  fashion  is  called  a  standard 
matrix  function. 

For  an  Hermitian  matrix  A,  when  we  write  the  power  function  Ap  or  the  exponential  eA  or 
the  logarithm  log  A,  we  are  always  referring  to  a  standard  matrix  function.  Note  that  we  only 
define  the  matrix  logarithm  for  positive-definite  matrices,  and  non-integer  powers  are  only  valid 
for  positive-semidefinite  matrices. 

The  following  result  is  an  immediate,  but  important,  consequence  of  the  definition  of  a  stan¬ 
dard  matrix  function. 

Proposition  2.1.3  (Spectral  Mapping  Theorem).  Let  A  be  an  Hermitian  matrix,  and  let  f :  IR  — ►  R. 
Each  eigenvalue  of  f{A)  has  the  form  /(A),  where  A  is  an  eigenvalue  of  A. 

In  most  cases,  the  “obvious”  generalization  of  an  inequality  for  real-valued  functions  fails  to 
hold  in  the  semidefinite  order.  Nevertheless,  there  is  one  class  of  inequalities  for  real  functions 
that  extends  to  give  semidefinite  relationships  for  matrix  functions. 

Proposition  2.1.4  (Transfer  Rule).  Let  f  and  g  be  real-valued  functions  defined  on  an  interval  I 
of  the  real  line,  and  let  A  be  an  Hermitian  matrix  whose  eigenvalues  are  contained  in  I.  Then 

f(a)<g{a )  for  each  ae  I  ==>  /(A)^g(A).  (2.1.6) 

Proof.  Decompose  A  =  QAQ* .  It  is  immediate  that  /(A)  =<:  g(A).  The  Conjugation  Rule  (2.1.4) 
allows  us  to  conjugate  this  relation  by  Q.  Finally,  invoke  the  definition  of  the  matrix  function  to 
complete  the  argument.  □ 

When  a  real  function  has  a  power  series  expansion,  we  can  also  represent  the  standard  matrix 
function  with  the  same  power  series  expansion.  Indeed,  suppose  that  :  /  — ►  US  is  defined  on  an 
interval  I  of  the  real  line,  and  assume  that  A  has  eigenvalues  in  I.  Then 

oo  oo 

f{a)-co+^Cpap  iorael  =>  f{A)  —  coI+  ^  cpAp. 

P= 1  p= 1 

This  formula  can  be  verified  using  an  eigenvalue  decomposition  of  A,  along  with  the  definition 
of  a  standard  matrix  function. 

2.1.8  The  Matrix  Exponential 

For  any  Hermitian  matrix  A,  we  can  introduce  the  matrix  exponential  eA  using  the  Definition  2.1.2 
of  a  standard  matrix  function.  Equivalently,  we  can  use  a  power  series  expansion: 

,  Ap 

eA  -  exp(A)  =  I+>  — . 

P'- 

The  Spectral  Mapping  Theorem,  Proposition  2.1.3,  implies  that  the  exponential  of  an  Hermitian 
matrix  is  always  positive  definite. 
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We  often  work  with  the  trace  of  the  matrix  exponential: 

trexp  :  A  > — *  tieA. 

This  function  has  several  properties  that  we  use  extensively.  First,  the  trace  exponential  is  mono¬ 
tone  with  respect  to  the  semidefrnite  order.  That  is,  for  Hermitian  matrices  A  and  H  of  the  same 
size, 

A^.H  =^>  treA<treH.  (2.1.7) 

The  trace  exponential  is  also  a  convex  function  on  the  real-linear  space  of  Hermitian  matrices. 
That  is,  for  Hermitian  matrices  A  and  H  of  the  same  size, 

trerA+TH  <TtveA  +  itreH  where  t  £  [0, 1]  and  f  =  1  -  t. 

In  other  words,  the  trace  exponential  of  an  average  is  no  greater  than  the  average  value  of  the 
trace  exponentials.  The  proofs  of  these  two  results  are  not  particularly  hard,  but  they  fall  outside 
the  boundary  of  these  notes.  See  the  survey  article  [Pet94,  Sec.  2]  or  the  lecture  notes  [Carlo, 
Sec.  2.2]  for  a  complete  demonstration. 

2.1.9  The  Matrix  Logarithm 

We  can  define  the  matrix  logarithm  as  a  standard  matrix  function.  The  matrix  logarithm  is  also 
the  functional  inverse  of  the  matrix  exponential: 

log(eA)  =  A  for  each  Hermitian  matrix  A.  (2.1.8) 

A  deep  and  significant  fact  about  the  matrix  logarithm  is  that  it  preserves  the  semidefrnite  order. 
For  positive-definite  matrices  A  and  H  of  the  same  size, 

0  <A^H  =>  log(A)  =<!  log(lT).  (2.1.9) 

For  a  good  treatment  of  operator  monotonicity  at  an  introductory  level,  see  [Bha97,  Chap.  V] . 
Let  us  emphasize  that  the  matrix  exponential  does  not  have  any  operator  monotonicity  property 
analogous  with  (2.1.9)! 

2.1.10  Singular  Values  of  General  Matrices 

A  general  matrix  B  does  not  have  an  eigenvalue  decomposition,  but  it  admits  a  different  repre¬ 
sentation  that  is  just  as  useful.  Every  d\  x  d2  matrix  B  has  a  singular  value  decomposition 

B  =  UZV*  with  17,  V  unitary  and  Z  nonnegative  diagonal.  (2.1.10) 

The  unitary  matrices  U  and  V  have  dimensions  d\  x  dt  and  d2  x  d2 ,  respectively.  The  inner  matrix 
Z  has  dimension  d\  x  d2,  and  we  use  the  term  diagonal  in  the  sense  that  only  the  diagonal  entries 
(Z)  j  j  may  be  nonzero. 

The  diagonal  entries  of  Z  are  called  the  singular  values  of  B.  They  are  determined  completely 
modulo  permutations,  and  it  is  standard  to  arrange  them  in  weakly  decreasing  order: 


£7l  (B)  >  £72  (B)  >  •  •  •  >  CTminirf,,  d2 }  («)■ 
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There  is  an  important  relationship  between  singular  values  and  eigenvalues.  A  general  matrix 
has  two  squares  associated  with  it,  BB*  and  B*  B,  both  of  which  are  Hermitian.  We  can  use  a 
singular  value  decomposition  of  B  to  construct  eigenvalue  decompositions  of  the  two  squares: 

BB*  -  l/(ZZ*){J*  and  B* B  =  V[X*Z)V* 

The  two  squares  of  Z  are  both  nonnegative,  diagonal,  and — of  course — square.  Conversely,  we 
can  always  extract  a  singular  value  decomposition  from  eigenvalue  decompositions  of  the  two 
squares. 


2. 1 . 1 1  The  Spectral  Norm  and  the  Euclidean  Norm 

The  spectral  norm  of  an  Hermitian  matrix  is  defined  by  the  relation 

|| A||  =  max{Amax(A),  -Amta(A)}. 

For  a  general  matrix  B ,  the  spectral  norm  is  defined  to  be  the  largest  singular  value: 

II JS II  =  ot(B). 


These  two  definitions  are  consistent  for  Hermitian  matrices. 

When  applied  to  a  row  vector  or  a  column  vector,  the  spectral  norm  coincides  with  the  Eu¬ 
clidean  norm: 


d 


ZM 


1/2 


for  b  G  Cd. 


U=i 


We  are  certainly  justified,  therefore,  in  using  the  same  symbol  for  both  norms. 


2.1.12  Dilations 


An  extraordinarily  fruitful  idea  from  operator  theory  is  to  embed  matrices  within  larger  block 
matrices,  called  dilations  [Pau02|. 

Definition  2.1.5  (Hermitian  Dilation).  The  Hermitian  dilation 


'■  Mdixd2  - *  Hdi+d2 

is  the  map  from  a  general  matrix  to  a  Hermitian  matrix  given  by 


B 

0 


(2.1.11) 


The  dilation  retains  important  spectral  information.  To  see  why,  note  that  the  square  of  the 
dilation  satisfies 


Ji?(B)2  = 


BB* 

0 


0 

B*  B 


(2.1.12) 


We  discover  that  the  squared  eigenvalues  of  Jf’(B)  coincide  with  the  squared  singular  values  of 
B,  along  with  an  appropriate  number  of  zeros.  Since  the  trace  of  J^T{B)  is  zero,  its  maximum 
eigenvalue  must  be  nonnegative.  Together,  these  two  facts  yield  an  important  identity: 


^max  wm)  =  \\jt?m\  =  \\B\\. 
Finally,  we  note  that  the  Hermitian  dilation  is  a  real-linear  map. 


(2.1.13) 
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2.2  Probability  Background 

We  continue  with  some  material  from  probability,  focusing  on  connections  with  matrices.  For 
more  details,  consult  any  good  probability  text. 


2.2.1  Conventions 

We  prefer  to  avoid  abstraction  and  unnecessary  technical  detail,  so  we  frame  the  standing  as¬ 
sumption  that  all  random  variables  are  sufficiently  regular  that  we  are  justified  in  computing 
expectations,  interchanging  limits,  and  so  forth.  All  the  manipulations  we  perform  are  valid  if 
we  assume  that  all  random  variables  are  bounded,  but  the  results  hold  in  broader  circumstances 
if  we  instate  appropriate  regularity  conditions. 


2.2.2  Random  Matrices 

Let  (Q,  J^,P)  be  a  probability  space,  and  let  M d1xd2  be  the  set  of  d\  x  d2  complex  matrices.  A 
random  matrix  Z  is  a  measurable  map 


z  '■  &  — *  Mdlxd2. 

It  is  more  natural  to  think  of  the  entries  of  Z  as  complex  random  variables  that  may  or  may  nor 
be  correlated  with  each  other.  We  reserve  the  letters  X,  Y  for  random  Hermitian  matrices,  and 
the  letter  Z  denotes  a  general  random  matrix. 

A  finite  sequence  {Zk}  of  random  matrices  is  independent  when 

P  {Zfc  e  Ek  for  each  k }  =  flk  P  {Zk  e  Ek } 

for  every  collection  {Ek}  of  Borel  subsets  of  xc/2. 


2.2.3  Expectation 

The  expectation  of  a  random  matrix  Z  =  [Zjk\  is  simply  the  matrix  formed  by  taking  the  compo¬ 
nentwise  expectation.  That  is, 

lEZ]]k  =  E(Zjk). 

Under  mild  assumptions,  expectation  commutes  with  linear  and  real-linear  maps.  Indeed,  ex¬ 
pectation  commutes  with  multiplication  by  a  fixed  matrix: 

E(BZ)  =  B(EZ)  and  E(ZB)  =  (EZ)B. 

In  particular,  the  product  rule  for  the  expectation  of  independent  random  variables  extends  to 
matrices: 

E(SZ)  =  (ESj(EZ)  when  S  and  Z  are  independent. 

We  use  these  identities  liberally,  without  any  further  comment. 
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2.2.4  Inequalities  for  Expectation 

Markov’s  inequality  states  that  a  nonnegative  (real)  random  variable  X  obeys  the  probability 
bound 

EX 

P{X>f}<—  where  X  >  0.  (2.2.1) 

The  Markov  inequality  is  a  central  tool  for  establishing  concentration  inequalities. 

Jensen’s  inequality  describes  how  averaging  interacts  with  convexity.  Let  Z  be  a  random  ma¬ 
trix,  and  let  h  be  a  real-valued  function  on  matrices.  Then 

Eh(Z)</t(EZ)  when  h  is  concave,  and 
h(EZ)<EJz(Z)  when  h  is  convex. 

Let  us  emphasize  that  these  inequalities  hold  for  every  real-valued  function  li  on  matrices  that  is 
concave  or  convex. 

The  expectation  of  a  random  matrix  can  be  viewed  as  a  convex  combination,  and  the  cone 
of  positive-semidefinite  matrices  is  convex.  Therefore,  expectation  preserves  the  semidefinite 
order: 

X4Y  =>  EX^EF. 

We  use  this  result  many  times  without  direct  reference. 


The  Matrix  Laplace  Transform 

Method 


This  chapter  contains  the  core  part  of  the  analysis  that  ultimately  delivers  matrix  concentration 
inequalities.  Readers  who  are  only  interested  in  the  concentration  inequalities  themselves  or  the 
sample  applications  may  wish  to  move  on  to  Chapters  4,  5,  and  6. 

The  approach  that  we  take  can  be  viewed  as  a  matrix  extension  of  the  Laplace  transform 
method,  sometimes  referred  to  as  the  “Bernstein  trick.”  In  the  scalar  setting,  this  so-called  trick 
is  one  of  the  most  basic  and  successful  paths  to  reach  concentration  inequalities  for  sums  of  in¬ 
dependent  random  variables.  It  turns  out  that  there  is  a  very  satisfactory  version  of  this  argument 
that  applies  to  sums  of  independent  random  matrices.  In  the  more  general  setting,  however,  we 
must  invest  more  care  and  wield  sharper  tools  to  execute  this  technique  successfully. 

We  first  define  matrix  analogs  of  the  moment  generating  function  and  the  cumulant  gener¬ 
ating  function,  which  pack  up  information  about  the  growth  of  a  random  matrix.  Section  3.2  ex¬ 
plains  how  we  can  use  the  matrix  mgf  to  obtain  probability  inequalities  for  the  maximum  eigen¬ 
value  of  a  random  Hermitian  matrix.  The  next  task  is  to  develop  a  bound  for  the  mgf  of  a  sum 
of  independent  random  matrices  using  information  about  the  summands.  In  §3.3,  we  discuss 
the  challenges  that  arise,  and  §3.4  presents  the  ideas  we  need  to  overcome  these  obstacles.  Sec¬ 
tion  3.5  establishes  that  the  classical  result  on  additivity  of  cumulants  has  a  companion  in  the 
matrix  setting.  This  result  allows  us  to  develop  a  collection  of  abstract  probability  inequalities 
in  §3.6  that  we  specialize  to  obtain  matrix  Chernoff  bounds,  matrix  Bernstein  bounds,  etc. 


3. 1  Matrix  Moments  and  Cumulants 

At  the  heart  of  the  Laplace  transform  method  are  the  moment  generating  function  (mgf)  and 
the  cumulant  generating  function  (cgf)  of  a  random  variable.  We  begin  by  presenting  matrix 
versions  of  the  mgf  and  cgf. 
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Definition  3.1.1  (Matrix  Mgf  and  Cgf).  Let X  be  a  random  Hermitian  matrix.  The  matrix  moment 
generating  function  M\  and  the  matrix  cumulant  generating  function  Ex  are  given  by 

Mx{0):- Ee0x  and  Ex(0)  :=  log  Eeex  forBeU.  (3.1.1) 

Note  that  the  expectations  may  not  exist  for  all  values  of  9. 

The  matrix  mgf  Mx  and  matrix  cgf  Ex  contain  information  about  how  much  the  random  matrix 
X  varies.  We  aim  to  exploit  the  data  encoded  in  these  functions  to  control  the  eigenvalues. 

To  expand  on  Definition  3.1.1,  let  us  observe  that  the  matrix  mgf  and  cgf  have  formal  power 
series  expansions: 


oo  np  oo  np 

Mx(0)  =  I+  £  — -E(XP)  and  Ex(0)  =  £  —  •  Vp. 

P !  P=i  P- 

We  call  the  coefficients  E(XP)  matrix  moments,  and  we  refer  to  'Fp  as  a  matrix  cumulant.  The 
matrix  cumulant  'Pp  has  a  formal  expression  as  a  (noncommutative)  polynomial  in  the  matrix 
moments  up  to  order  p.  In  particular,  the  first  cumulant  is  the  mean  and  the  second  cumulant 
is  the  variance: 

'Fi=EX  and  *¥2  =  E(X2)  -  (EX)2. 

Higher- order  cumulants  are  harder  to  write  down  and  interpret. 

3.2  The  Matrix  Laplace  Transform  Method 

In  the  scalar  setting,  the  Laplace  transform  method  allows  us  to  obtain  tail  bounds  for  a  random 
variable  in  terms  of  its  mgf.  The  starting  point  for  our  theory  is  the  observation  that  a  similar 
result  holds  in  the  matrix  setting. 

Proposition  3.2.1  (Tail  Bounds  for  Eigenvalues).  Let  Y  be  a  random  Hermitian  matrix.  For  all 

teU, 


P {Amax(T)  >  t]  <  inf  e  etEtieB¥,  and  (3.2.1) 

0>o 

P{Amin(T)  <  t}  <  inf  e~9t  EtreSF.  (3.2.2) 

0<O 

In  words,  we  can  control  the  tail  probabilities  of  the  extreme  eigenvalues  of  a  random  matrix 
by  producing  a  bound  for  the  trace  of  the  matrix  mgf.  The  proof  of  this  fact  parallels  the  classical 
argument,  but  there  is  a  twist. 

Proof.  We  begin  with  (3.2.1).  Fix  a  positive  number  9,  and  observe  that 

P{Amax(T)  >  t]  =  IP  je0Am“(F)  >  e6f|  <  e~Bt  Ee0Amax(F). 

The  first  identity  holds  because  a  *— *  eBa  is  a  monotone  increasing  function,  so  the  event  doesn’t 
change  under  the  mapping.  The  second  relation  is  Markov’s  inequality  (2.2.1).  To  control  the 
exponential,  note  that 


e0WV)  =  eWevj  =  Amax(e^)  <  tre0F. 


(3.2.3) 


3.3.  THE  FAILURE  OF  THE  MATRIX  MGF 


23 


The  first  identity  holds  because  the  maximum  eigenvalue  map  is  positive  homogeneous,  as  stated 
in  (2.1.2).  The  second  depends  on  the  Spectral  Mapping  Theorem,  Proposition  2.1.3.  The  in¬ 
equality  holds  because  the  exponential  of  an  Hermitian  matrix  is  positive  definite,  and  (2.1.5) 
shows  that  the  maximum  eigenvalue  of  a  positive-definite  matrix  is  dominated  by  the  trace. 
Combine  the  latter  two  relations  to  reach 

P{Amax{Y)>t}<e-Bt  Etre0F. 

This  inequality  holds  for  any  positive  6,  so  we  may  take  an  infimum  to  achieve  the  tightest  pos¬ 
sible  bound. 

To  prove  (3.2.2),  we  use  a  similar  approach.  Fix  a  negative  number  0,  and  calculate  that 

P{Amin(F)  <  t}  -  p>  je0Amin(F)  >  e0f  j  <  e~et  Ee0Amin(F)  =  e“0f  EeAmax(0F). 

The  function  a  >-►  e0fl  reverses  the  inequality  in  the  event  because  it  is  monotone  decreasing. 
The  third  relation  owes  to  the  relationship  (2.1.3)  between  minimum  and  maximum  eigenvalues. 
Finally,  introduce  the  inequality  (3.2.3)  for  the  trace  exponential  and  minimize  over  negative 
6.  □ 

In  the  proof  of  Proposition  3.2.1,  it  may  seem  crude  to  bound  the  maximum  eigenvalue  by 
the  trace.  It  turns  out  that,  at  most,  this  estimate  results  in  a  loss  of  a  logarithmic  factor.  At  the 
same  time,  the  maneuver  allows  us  to  exploit  some  amazing  convexity  properties  of  the  trace 
exponential. 

We  can  adapt  the  proof  of  Proposition  3.2.1  to  obtain  bounds  for  the  expectation  of  the  max¬ 
imum  eigenvalue  of  a  random  Hermitian  matrix.  This  argument  does  not  have  a  perfect  analog 
in  the  scalar  setting. 

Proposition  3.2.2  (Expectation  Bounds  for  Eigenvalues).  Let  Y  be  a  random  Hermitian  matrix. 
Then 

EAmax(F)  <  inf  -logEtre0F,  and 
0>o  6 

EAmin(F)>sup  -|-logEtre0F. 

0<o  0 

Proof.  We  establish  the  bound  (3.2.4);  the  proof  of  (3.2.5)  is  quite  similar.  Fix  a  positive  number 
9,  and  calculate  that 

EA max(F)=  ~Amax(0F)  =  ^ logexp Amax(0F)  =  ^logAmax(e0F)  <  ^logtre0F. 
at)  U  o 

The  first  identity  holds  because  the  maximum  eigenvalue  map  is  positive  homogeneous,  as  stated 
in  (2.1.2).  The  third  follows  when  we  use  the  Spectral  Mapping  Theorem,  Proposition  2.1.3  to 
draw  the  exponential  inside  the  eigenvalue  map.  The  inequality  depends  on  the  fact  (2.1.5)  that 
the  trace  of  a  positive-definite  matrix  dominates  the  maximum  eigenvalue.  □ 

3.3  The  Failure  of  the  Matrix  Mgf 

We  would  like  the  use  the  Laplace  transform  bounds  from  Section  3.2  to  study  a  sum  of  inde¬ 
pendent  random  matrices.  In  the  scalar  setting,  the  Laplace  transform  method  is  effective  for 


(3.2.4) 

(3.2.5) 
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studying  independent  sums  because  the  mgf  and  the  cgf  decompose.  In  the  matrix  case,  the 
situation  is  more  subtle,  and  the  goal  of  this  section  is  to  indicate  where  things  go  awry. 

Consider  an  independent  sequence  {Xj;}  of  real  random  variables.  The  mgf  of  the  sum  satis¬ 
fies  a  multiplication  rule: 


M(M(0)  =  Eexp£fc0Xfc)  =  E]JkeeXk  =  \\k^6Xk  =  (3.3.1) 

At  first,  we  might  imagine  that  a  similar  relationship  holds  for  the  matrix  mgf.  Consider  an  inde¬ 
pendent  sequence  {Xj;}  of  random  Hermitian  matrices.  Perhaps, 

M^kxk]{0)  =  ]\kMXk{B).  (3.3.2) 

Unfortunately,  this  hope  shatters  when  we  subject  it  to  interrogation. 

It  is  not  hard  to  find  the  reason  that  (3.3.2)  fails.  Note  that  the  identity  (3.3.1)  depends  on 
the  fact  that  the  scalar  exponential  converts  a  sum  into  a  product.  In  contrast,  for  Hermitian 
matrices, 

eA+H  eAeH  unless  A  and  H  commute. 

If  we  introduce  the  trace,  the  situation  improves  somewhat: 

XxeA+H  <  treAeH  for  all  Hermitian  A,  H.  (3.3.3) 

The  result  (3.3.3)  is  known  as  the  Golden-Thompson  inequality,  a  famous  theorem  from  statisti¬ 
cal  physics.  Unfortunately,  the  analogous  bound  may  fail  for  three  matrices: 

tieA+H+M  £  treAeHeM  for  certain  Hermitian  A,  H,  M. 

It  seems  that  we  have  reached  an  impasse. 

What  if  we  consider  the  cgf  instead?  The  cgf  of  a  sum  of  independent  random  variables  sat¬ 
isfies  an  addition  rule: 


Z(Lkxk)tf)  =  logEexp{£fc0Xfc}  =  log  fcEe0x*  =  LfcSXjt(0).  (3.3.4) 

The  relation  (3.3.4)  follows  when  we  extract  the  logarithm  of  the  multiplication  rule  (3.3.1).  This 
result  looks  like  a  more  promising  candidate  for  generalization  because  a  sum  of  Hermitian  ma¬ 
trices  remains  Hermitian.  We  might  hope  that 

-(Lkxk)(6)  =  £fcExt(0). 

As  stated,  this  putative  identity  also  fails.  Nevertheless,  the  addition  rule  (3.3.4)  admits  a  very  sat¬ 
isfactory  extension  to  matrices.  In  contrast  with  the  scalar  case,  the  proof  involves  much  deeper 
considerations. 

3.4  A  Theorem  of  Lieb 

To  find  the  appropriate  generalization  of  the  addition  rule  for  cgfs,  we  turn  to  the  literature  on 
matrix  analysis.  Here,  we  discover  a  famous  result  of  Elliott  Lieb  on  the  convexity  properties  of 
the  trace  exponential  function. 
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Theorem  3.4.1  (Lieb).  Fix  an  Hermitian  matrix  H  with  dimension  d.  The  function 

A ' — *  trexp  [H  +  log(^4)) 

is  concave  on  the  positive-definite  cone  in  dimension  d. 

In  the  scalar  case,  the  analogous  function  a  >-►  exp (h  +  log(a))  is  linear,  so  this  result  describes  a 
new  type  of  phenomenon  that  emerges  when  we  move  to  the  matrix  setting.  Theorem  3.4.1  is 
not  easy  to  prove,  so  we  must  take  it  for  granted. 

Let  us  focus  on  the  consequences  of  this  remarkable  result.  Lieb’s  Theorem  is  valuable  to  us 
because  the  Laplace  transform  bounds  from  Section  3.2  involve  the  trace  exponential  function. 
To  highlight  the  connection,  let  us  rephrase  Theorem  3.4.1  in  probabilistic  terms. 

Corollary  3.4.2.  Let  H  be  a  fixed  Hermitian  matrix,  and  let  X  be  a  random  Hermitian  matrix  of 
the  same  size.  Then 

Etrexp(H-t-X)  <  trexp(H+log(Eex)) . 

Proof.  Introduce  the  random  matrix  Y  —  ex .  Then 

E  trexp  (H  +  X)  =  Etrexp(ff  +  log(F)) 

<  trexp(ff  +  log(EF))  =  trexp  [H  +  log (Eex)) . 

The  first  identity  follows  from  the  definition  (2.1.8)  of  the  matrix  logarithm  as  the  functional 
inverse  of  the  matrix  exponential.  Theorem  3.4.1  shows  that  the  trace  function  is  concave  in  F, 
so  Jensen’s  inequality  (2.2.2)  allows  us  to  draw  the  expectation  inside  the  function.  □ 

3.5  Subadditivity  of  the  Matrix  Cgf 

We  are  now  prepared  to  generalize  the  addition  rule  (3.3.4)  for  scalar  cgfs  to  the  matrix  setting. 
The  following  result  is  fundamental  to  our  approach. 

Lemma3.5.1  (Subadditivity  of  Matrix  Cgfs).  Consider  a  finite  sequence  {Xk}  of  independent,  ran¬ 
dom,  Hermitian  matrices  of  the  same  size.  Then 

Etrexp  (£k0Xk)  <  trexp  (£fclog  Ee0Xfc)  ford  e  U.  (3.5.1) 

Equivalently, 

trexp (0))  <  trexp £*.2^.(0))  forBeU.  (3.5.2) 

The  parallel  between  the  additivity  rule  (3.3.4)  and  the  subadditivity  rule  (3.5.2)  is  striking. 
With  our  level  of  preparation,  it  is  easy  to  prove  this  result:  We  just  apply  the  bound  from  Corol¬ 
lary  3.4.2  repeatedly. 

Proof.  To  simplify  notation,  we  take  0  =  1.  Let  Ek  denote  the  expectation  with  respect  to  Xk,  the 
remaining  random  matrices  held  fixed.  Abbreviate 

Hfc  :=  log(Efc eXk)  =  log(Eex*) . 


We  may  calculate  that 

Etrexp  Xk)  =  EE„  trexp  Xk  +  X„) 
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<  E  tr  exp  (£ III  Xk  +  log  (E„  ex" )  ] 

=  EE„_i  trexp  (Y,'k=i  Xk  +  Xn- 1  +  sh) 

<  EE„_2 trexp  (LfcZi  Xk  +  Hn-i  +  S„J 

...  <  trexp  E"=1Sfc). 

We  can  introduce  iterated  expectations  because  of  the  tower  property  of  conditional  expectation. 
At  each  step  m=  1,2,...,/;,  we  invoke  Corollary  3.4.2  with  the  fixed  matrix  H  equal  to 

m- 1  n 

H,n  —  ^  Xk  X/  “fc- 

k=  1  k~m+ 1 

This  argument  is  legitimate  because  Hm  is  independent  from  Xm. 

The  equivalent  formulation  (3.5.2)  follows  from  (3.5.1)  when  we  substitute  the  definition  (3.1.1) 
of  the  matrix  cgf  and  make  some  algebraic  simplifications.  □ 

3.6  Master  Bounds  for  Independent  Sums  of  Matrices 

Finally,  we  can  present  some  general  results  on  the  behavior  of  a  sum  of  independent  random 
matrices.  At  this  stage,  we  simply  combine  the  Laplace  transform  bounds  with  the  subadditivity 
of  the  matrix  cgf  to  obtain  abstract  inequalities.  Later,  we  will  harness  properties  of  the  sum¬ 
mands  to  develop  more  concrete  estimates  that  apply  to  specific  examples  of  interest. 

Theorem  3.6.1  (Master  Bound  for  an  Independent  Sum  of  Matrices).  Consider  a  finite  sequence 
\XCi  of  independent,  random,  Hermitian  matrices.  Then 

EAmax(£fcXfc)  <  inf  ^  log  trexp  [^j.logEe9^],  and  (3.6.1) 

EAmin(£fcXfc)  >sup  ^  log  tr  exp  log  E  e0Xj:  ] .  (3.6.2) 

0<O  y  v  ’ 

Furthermore,  for  all  teU, 

P^maxEfc^t)  ^  t]  <  inf  e~0t  trexp  ^Tk  log  Ee0x<:),  and  (3.6.3) 

P5  {^min  (LfcXfc)  S  t\  <  inf  e  er  trexp  log  Ee0x‘).  (3.6.4) 


Furthermore, 

Proof.  Substitute  the  subadditivity  rule  for  matrix  cgfs,  Lemma  3.5.1,  into  the  two  matrix  Laplace 
transform  results,  Proposition  3.2.1  and  Proposition  3.2.2.  □ 

In  this  chapter,  we  have  focused  on  probability  inequalities  for  the  extreme  eigenvalues  of  a 
sum  of  independent  random  matrices.  Nevertheless,  these  results  also  give  information  about 
the  spectral  norm  of  a  sum  of  independent,  random,  general  matrices  because  we  can  apply 
them  to  the  Hermitian  dilation  of  the  sum.  Instead  of  presenting  a  general  theorem,  we  find  it 
more  natural  to  extend  the  specific  tail  bounds  to  general  matrices. 
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3.7  Notes 

This  section  includes  some  historical  discussion  about  the  results  we  have  described  in  this 
chapter,  along  with  citations  for  the  results  that  we  have  established. 

3.7.1  The  Matrix  Laplace  Transform  Method 

The  idea  of  lifting  the  “Bernstein  trick”  to  the  matrix  setting  is  due  to  two  researchers  in  quan¬ 
tum  information  theory,  Rudolf  Ahlswede  and  Andreas  Winter,  who  were  working  on  a  problem 
concerning  transmission  of  information  through  a  quantum  channel  [AW02] .  Their  paper  con¬ 
tains  a  version  of  the  matrix  Laplace  transform  result,  Proposition  3.2.1,  along  with  a  substantial 
number  of  related  foundational  ideas.  Their  work  is  one  of  the  major  inspirations  for  the  tools 
that  are  described  in  these  notes. 

The  precise  version  of  Proposition  3.2.1  and  the  proof  that  we  present  here  are  due  to  Roberto 
Oliveira,  from  his  an  elegant  paper  [OlilOb] .  The  subsequent  result  on  expectations,  Proposi¬ 
tion  3.2.2,  first  appeared  in  the  paper  [CGT12a]. 

3.7.2  Subadditivity  of  Cumulants 

The  major  impediment  to  applying  the  matrix  Laplace  transform  method  is  the  need  to  produce 
a  bound  for  the  trace  of  the  matrix  moment  generating  function  (the  trace  mgf).  This  is  where 
all  the  technical  difficulty  in  the  argument  resides.  Ahslwede  and  Winter  [AW02,  App.]  proposed 
a  different  approach  for  bounding  the  trace  mgf  of  an  independent  sum,  based  on  a  repeated 
application  of  the  Golden-Thompson  inequality  (3.3.3).  The  Ahlswede-Winter  argument  leads 
to  a  cumulant  bound  of  the  form 

Etrexp(£fcXfc)  <  d  •  exp  (£fc  Amax(log  Eex*)) .  (3.7.1) 

In  other  words,  they  bound  the  cumulant  of  a  sum  in  terms  of  the  sum  of  maximum  eigenval¬ 
ues  of  the  cumulants.  There  are  cases  where  the  bound  (3.7.1)  is  equivalent  with  Lemma  3.5.1. 
For  example,  the  bounds  coincide  when  each  matrix  X %  is  identically  distributed.  In  general, 
however,  the  estimate  (3.7.1)  leads  to  fundamentally  weaker  results. 

The  first  major  technical  advance  beyond  the  original  argument  of  Ahlswede  and  Winter  ap¬ 
pears  in  another  paper  [OlilOa]  of  Oliveira.  He  developed  a  much  more  effective  way  to  de¬ 
ploy  the  Golden-Thompson  inequality,  and  he  used  this  technique  to  establish  a  matrix  ver¬ 
sion  of  Freedman’s  inequality  [Fre75].  In  the  scalar  setting,  Freedman’s  inequality  extends  the 
Bernstein  concentration  inequality  to  martingales.  Oliveira  obtained  the  analogous  extension  of 
Bernstein’s  inequality  for  matrix- valued  martingales.  When  specialized  to  independent  sums,  his 
result  is  quite  similar  to  the  matrix  Bernstein  inequality,  Theorem  6.1.1,  apart  from  the  precise 
values  of  the  constants.  Oliveira’s  method,  however,  does  not  seem  to  deliver  the  full  spectrum 
of  matrix  concentration  inequalities  that  we  discuss  in  these  notes. 

The  approach  we  describe  here,  based  on  Lieb’s  Theorem,  was  developed  in  the  paper  [Trol  Id] . 
This  research  recognized  the  probabilistic  content  of  Lieb’s  Theorem,  Corollary  3.4.2,  and  it  used 
this  idea  to  establish  Lemma  3.5.1,  on  the  subadditivity  of  cumulants,  along  with  the  master  tail 
bounds  from  Theorem  3.6.1.  Note  that  the  two  articles  [OlilOa,  Trolld]  are  independent  works. 

For  a  detailed  discussion  of  the  benefits  of  Lieb’s  Theorem  over  the  Golden-Thompson  in¬ 
equality,  turn  to  [Trolld,  §4].  In  summary,  to  get  the  sharpest  concentration  results  for  random 
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matrices,  Lieb’s  theorem  is  indispensible.  The  Ahlswede-Winter  approach  seems  to  be  intrinsi¬ 
cally  weaker.  Oliveira’s  argument  has  certain  advantages,  however,  in  that  it  extends  from  matri¬ 
ces  to  the  fully  noncommutative  setting  [JZ12] . 

Subsequent  research  on  the  underpinnings  of  the  matrix  Laplace  transform  method  has  led 
to  a  martingale  version  of  the  subadditivity  of  cumulants  [Trolla,  Trollc];  these  works  also  de¬ 
pend  on  Lieb’s  Theorem.  Another  paper  [GT11]  shows  how  to  use  a  more  general  result,  called 
the  Lieb-Seiringer  Theorem  [LS05],  to  obtain  upper  and  lower  tail  bounds  for  all  eigenvalues  of 
a  sum  of  independent  random  Hermitian  matrices. 

3.7.3  Noncommutative  Moment  Inequalities 

There  is  a  closely  related,  and  much  older,  line  of  research  on  noncommutative  moment  in¬ 
equalities.  These  results  provide  information  about  the  expected  trace  of  a  power  of  a  sum  of 
independent  random  matrices.  The  matrix  Laplace  transform  method,  as  encapsulated  in  The¬ 
orem  3.6.1,  gives  analogous  bounds  for  the  exponential  moments. 

This  research  originates  in  an  important  paper  [LP86]  of  Francoise  Lust-Picquard.  This  arti¬ 
cle  develops  an  extension  of  the  Khintchine  inequality  for  matrices.  Her  result  concerns  a  sum  of 
fixed  matrices  that  are  modulated  by  independent  Gaussian  random  variables.  It  shows  that  the 
expected  trace  of  an  even  power  of  this  random  matrix  is  controlled  by  its  variance.  Subsequent 
papers  have  refined  the  noncommutative  Khintchine  inequality  to  its  optimal  form  [LPP91,  BucOl, 
BucOS]. 

In  recent  years,  researchers  have  generalized  other  moment  inequalities  for  sums  of  scalar 
random  variables  to  matrices  (and  beyond).  For  instance,  the  Rosenthal  inequality,  concern¬ 
ing  a  sum  of  independent  zero-mean  random  variables,  admits  a  matrix  version  [JZ11,  MJC+ 12, 
CGT12a].  See  the  paper  [JX05]  for  a  good  overview  of  some  other  noncommutative  moment 
inequalities. 

Finally,  and  tangentially,  we  mention  that  matrix  moments  and  cumulants  also  play  a  central 
role  in  the  theory  of  free  probability  [Spell]. 

3.7.4  Quantum  Statistical  Mechanics 

A  curious  feature  of  the  theory  of  matrix  concentration  inequalities  is  that  the  most  powerful 
tools  come  from  the  mathematical  theory  of  quantum  statistical  mechanics.  This  field  studies 
the  bulk  statistical  properties  of  interacting  quantum  systems,  and  it  would  seem  quite  distant 
from  the  field  of  random  matrix  theory.  The  connection  between  these  two  areas  has  emerged 
because  of  research  on  quantum  information  theory,  which  studies  how  information  can  be  en¬ 
coded,  operated  upon,  and  transmitted  via  quantum  mechanical  systems. 

The  Golden-Thompson  inequality  is  a  major  result  from  quantum  statistical  mechanics.  For 
a  detailed  treatment  from  the  perspective  of  matrix  theory,  see  Bhatia’s  book  [Bha97,  Sec.  IX.3] . 
The  fact  that  the  Golden-Thompson  inequality  fails  for  three  matrices  can  be  obtained  from  sim¬ 
ple  examples,  such  as  combinations  of  Pauli  spin  matrices  [Bha97,  Exer.  IX.8.4].  For  an  account 
with  more  physical  content,  see  the  book  of  Thirring  [Thi02]. 

Lieb’s  Theorem  [Lie73,  Thm.  6]  was  first  established  in  an  important  paper  of  Elliott  Lieb  on 
the  convexity  of  trace  functions.  His  argument  is  difficult.  Subsequent  work  has  led  to  more 
direct  routes  to  the  result.  Epstein  provides  an  alternative  proof  of  Theorem  3.4.1  in  [Eps73, 
Sec.  II],  and  Ruskai  offers  a  simplified  account  of  Epstein’s  argument  in  [Rus02,  Rus05],  The 
note  [Trollb]  shows  how  to  derive  Lieb’s  theorem  from  the  joint  convexity  of  quantum  relative 
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entropy  [Lin74,  Lem.  2],  The  latter  approach  is  advantageous  because  the  joint  convexity  result 
admits  several  elegant,  conceptual  proofs  [Pet86,  Eff09] . 
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In  this  chapter,  we  present  our  first  set  of  matrix  concentration  inequalities.  These  results  pro¬ 
vide  spectral  information  about  a  sum  of  fixed  matrices,  modulated  by  independent  scalar  ran¬ 
dom  variables.  This  type  of  formulation  is  surprisingly  versatile,  and  it  already  encompasses  a 
range  of  interesting  examples. 

To  be  more  precise  about  our  scope,  let  us  introduce  the  concept  of  a  matrix  Gaussian  series. 
Consider  a  finite  sequence  {A^}  of  fixed  Hermitian  matrices  with  the  same  dimension,  along 
with  a  finite  sequence  {y^}  of  independent  standard  normal  random  variables.  We  will  analyze 
the  extreme  eigenvalues  of  the  random  matrix 

Y  =  LkrkAk- 

As  an  example,  we  can  express  a  Wigner  matrix,  one  of  the  classical  random  matrices,  in  this 
fashion.  The  real  value  of  this  perspective,  however,  is  that  we  can  use  matrix  Gaussian  series  to 
represent  many  other  kinds  of  random  matrices  formed  from  Gaussian  random  variables.  These 
models  allow  us  to  attack  problems  that  classical  methods  do  not  handle  gracefully.  For  instance, 
we  can  study  a  symmetric  Toeplitz  matrix  with  Gaussian  entries. 

We  do  not  need  to  limit  our  attention  to  the  Hermitian  case.  This  chapter  also  contains 
bounds  on  the  spectral  norm  of  a  Gaussian  series  with  general  matrix  coefficients.  Remarkably, 
these  results  follow  as  an  immediate  corollary  of  the  Hermitian  theory.  This  theory  brings  rect¬ 
angular  matrices  based  on  Gaussian  variables  within  our  purview. 

Furthermore,  similar  ideas  allow  us  to  treat  a  matrix  Rademacher  series,  a  sum  of  fixed  ma¬ 
trices  modulated  by  random  signs.  (Recall  that  a  Rademacher  random  variable  takes  values  in 
{+ 1}  with  equal  probability.)  The  results  in  this  case  are  almost  identical  with  the  results  for  ma¬ 
trix  Gaussian  series,  but  they  allow  us  to  consider  new  problems.  For  instance,  we  can  study  the 
expected  spectral  norm  of  a  fixed  real  matrix  after  flipping  the  signs  of  the  entries  at  random. 

We  begin,  in  §§4. 1-4.2,  with  an  overview  of  our  results  for  matrix  Gaussian  series;  very  similar 
results  also  hold  for  matrix  Rademacher  series.  Afterward,  in  §4.3,  we  discuss  the  accuracy  of  the 
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theoretical  bounds.  The  subsequent  sections,  §§4.4-4. 6,  describe  what  the  matrix  concentration 
inequalities  tell  us  about  some  classical  and  not-so-classical  examples  of  random  matrices.  Sec¬ 
tion  4.7  includes  an  overview  of  a  more  substantial  application  in  combinatorial  optimization. 
The  final  part  of  the  chapter,  §§4.8-4. 9,  contains  detailed  proofs  of  the  bounds.  We  conclude 
with  bibliographical  notes. 

4. 1  Series  with  Hermitian  Matrices 

Consider  a  finite  sequence  {a^}  of  real  numbers  and  a  finite  sequence  IjC  of  independent  stan¬ 
dard  normal  random  variables.  A  routine  invocation  of  the  scalar  Laplace  transform  method 
demonstrates  that 

t}<e~tIl2cjl  where  a2  =  Y,kak-  (4.1.1) 

This  result  indicates  that  the  upper  tail  of  a  scalar  Gaussian  series  behaves  like  the  upper  tail  of  a 
single  Gaussian  random  variable  with  variance  a2.  It  turns  out  that  the  inequality  (4.1.1)  extends 
directly  to  the  matrix  setting. 

Theorem  4.1.1  (Matrix  Gaussian  and  Rademacher  Series:  The  Hermitian  Case) .  Consider  a  finite 
sequence  {AC  of  fixed  Hermitian  matrices  with  dimension  d,  and  let  \jfit  be  a  finite  sequence  of 
independent  standard  normal  variables.  Form  the  matrix  Gaussian  series 

y=LtTkAk. 


Compute  the  variance  parameter 


cr2  =  cr2(Y)  =  ||E(F2)||. 


(4.1.2) 


Then 

E  Amax  ( F)  <  \J 2.o2  log  d.  (4. 1.3) 

Furthermore,  for  all  t  >  0, 

IP  {A  max(F)>r}<de-f2/2ff2.  (4.1.4) 

The  same  bounds  hold  when  we  replace  \j C  by  a  finite  sequence  of  independent  Rademacher 
random  variables. 

The  proof  of  this  result  appears  below  in  §4.8. 


4.1.1  Discussion 

Let  us  take  a  moment  to  discuss  the  content  of  Theorem  4.1.1.  The  main  message  is  that  the  ex¬ 
pectation  of  the  maximum  eigenvalue  of  Y  is  controlled  by  the  matrix  variance  cr2 .  Furthermore, 
the  maximum  eigenvalue  of  F  has  a  Gaussian  tail  whose  decay  rate  depends  on  a2. 

We  can  obtain  a  more  explicit  expression  for  the  variance  (4.1.2)  in  terms  of  the  coefficients 
in  the  Gaussian  series.  Simply  compute  that 

cj2(F)  =  ||E(F2)||  =  ^(Zj.kTjnAjAkW  =  \\Zk4 


(4.1.5) 


4.2.  SERIES  WITH  GENERAL  MATRICES 


33 


The  second  identity  follows  because  {j^}  is  an  independent  family.  As  in  the  scalar  case  (4.1.1), 
the  variance  is  the  sum  of  the  squares  of  the  coefficients. 

A  new  feature  of  the  bound  (4.1.4)  is  the  dimensional  factor  d.  When  d  =  1,  this  factor  van¬ 
ishes,  and  the  matrix  bound  coincides  with  the  scalar  result  (4.1.1).  When  d=  1,  the  expectation 
bound  (4.1.3)  also  produces  a  sharp  result,  namely  EAmax(F)  <  0.  In  this  case,  at  least,  we  have 
lost  nothing  by  lifting  the  Laplace  transform  method  to  matrices.  In  §4.3,  we  discuss  the  extent 
to  which  Theorem  4.1.1  provides  accurate  predictions. 

Finally,  the  reader  may  be  concerned  about  the  lack  of  explicit  inequalities  for  the  minimum 
eigenvalue  Amin(F).  But  these  bounds  are  consequences  of  the  results  for  the  maximum  eigen¬ 
value  because  -F  has  the  same  distribution  as  Y.  Therefore, 


EAmin(F)  -  EAmin (-F)  =  -EAmax(F)  > -y^log d.  (4.1.6) 

The  second  identity  holds  because  of  the  relationship  (2.1.3)  between  minimum  and  maximum 
eigenvalues.  Similar  considerations  lead  to  a  lower  tail  bound  for  the  minimum  eigenvalue: 

PUmin  (F)<-r}<de-f2/2ff2  forr>0.  (4.1.7) 

This  result  follows  directly  from  the  upper  tail  bound  (4.1.4). 

4.2  Series  with  General  Matrices 

Most  of  the  inequalities  in  these  notes  can  be  adapted  to  study  the  spectral  norm  of  a  sum  of 
general  random  matrices.  Although  this  problem  might  seem  to  have  a  character  different  from 
the  Hermitian  case,  the  results  for  general  matrices  are  an  easy  formal  consequence  of  the  theory 
for  Hermitian  matrices.  Here  is  the  extension  of  Theorem  4.1.1. 

Corollary  4.2.1  (Matrix  Gaussian  and  Rademacher  Series:  The  General  Case).  Consider  a  finite 
sequence  {B^}  of  fixed  complex  matrices  with  dimensions  d\  x  d2,  and  let  {yCi  he  a  finite  sequence 
of  independent  standard  normal  variables.  Form  the  matrix  Gaussian  series 


z='LkTkBk- 


Compute  the  variance  parameter 

o2  -  o2(Z)  -  max{||E(ZZ*) || ,  ||E(Z*Z)||}.  (4.2.1) 


Then 

E  ||  Z||  <  y^logWi  +  ry.  (4.2.2) 

Furthermore,  for  all  t  >  0, 

P{||Z||  >  t}  <  (di  +  df)  e~t2/2cr2 .  (4.2.3) 

The  same  bounds  hold  when  we  replace  {yC  by  a  finite  sequence  of  independent  Rademacher 
random  variables. 


The  proof  of  Corollary  4.2.1  appears  below  in  §4.9. 
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4.2.1  Discussion 

The  results  for  rectangular  matrices  are  similar  with  the  results  in  Theorem  4.1.1  for  Hermitian 
matrices,  so  many  of  the  same  intuitions  apply.  Still,  the  differences  deserve  some  comment. 

The  most  salient  change  occurs  in  the  definition  (4.2.1)  of  the  variance  parameter.  The  vari¬ 
ance  has  this  particular  form  because  a  general  matrix  has  two  squares  associated  with  it,  and 
we  can  omit  neither  one.  Note  that,  when  Z  is  Hermitian,  the  general  variance  (4.2.1)  reduces  to 
the  Hermitian  variance  (4.1.2),  so  the  new  definition  extends  the  previous  one. 

To  represent  the  variance  in  terms  of  the  coefficient  matrices,  we  simply  calculate  that 

(72(Z)  =  max{||E(ZZ*)||,  ||E(Z*Z)||} 

=  max-{\E{L].krirkBjBl)\,  |e(J\( kT]rkB]Bk)  | }  (4.2.4) 

=  max{||£fci}jfcB*||,  ||£fcB*Bfc||}. 

The  expression  (4.2.4)  provides  a  natural  formulation  of  the  “sum  of  squares”  of  a  sequence  of 
general  matrices. 

The  dimensional  factor  d\  +  d2  in  Corollary  4.2.1  apparently  differs  from  the  factor  d  that 
appears  in  Theorem  4.1.1.  Nevertheless,  properly  interpreted,  the  two  results  coincide:  Observe 
that  we  must  bound  the  maximum  and  minimum  eigenvalues  of  a  Hermitian  Gaussian  series  Y 
to  control  its  spectral  norm.  Thus, 

IP {||  F||  >  tj  <  2de~t2 12°2 .  (4.2.5) 

This  inequality  follows  when  we  apply  the  union  bound  to  the  upper  (4.1.4)  and  lower  (4.1.7)  tail 
bounds.  The  dimensional  factor  d\  +  d2  in  Corollary  4.2.1  matches  the  factor  2d  in  (4.2.5).  We 
conclude  that  it  is  appropriate  for  both  dimensions  of  the  general  matrix  to  play  a  role. 

4.3  Are  the  Bounds  Sharp? 

One  may  wonder  whether  Theorem  4. 1.1  and  Corollary  4.2.1  provide  accurate  information  about 
the  behavior  of  a  matrix  Gaussian  series.  The  answer  turns  out  to  be  complicated,  so  we  must 
limit  ourselves  to  a  summary  of  facts. 

First,  we  consider  the  bound  (4.2.2)  for  the  expectation  of  a  Gaussian  series  Z  taking  d\  x  d2 
matrix  values: 

E  || Z||  <  \J 2u2  loghi]  +  d2), 

where  cr2  is  defined  in  (4.2.1).  Since  the  upper  tail  of  ||Z||  decays  so  quickly,  it  is  easy  to  believe 
(and  true!)  that 

E  || Z|| 2  <  2cr  log(r?i  +  d2). 

On  the  other  hand,  since  the  spectral  norm  is  convex,  Jensen’s  inequality  (2.2.2)  shows  that 

E(||Z||2)  =  Emax{||ZZ*  || ,  ||Z*Z|| }  >  max{||E(ZZ*) || ,  ||E(Z*Z)|| }  =  a2. 

The  first  identity  holds  because  ||Z||2  =  ||ZZ*||  =  ||Z*Z||.  The  final  relation  depends  on  the  cal¬ 
culation  (4.2.4).  In  summary, 

cr2  <  E(||Z||2)  <  2cr2log(rfi  +  d2). 


(4.3.1) 
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We  see  that  the  matrix  variance  cr2  defined  in  (4.2.1)  is  roughly  the  correct  scale  for  E(||Z||2).  In 
general,  it  is  a  challenging  problem  to  identify  the  expected  norm  of  a  Gaussian  series,  so  the 
estimate  (4.3.1)  is  already  a  significant  achievement. 

At  this  point,  we  might  ask  whether  either  side  of  the  inequality  (4.3.1)  can  be  tightened.  The 
answer  is  negative,  unless  we  have  additional  information  beyond  the  variance  a2.  There  are 
examples  of  matrix  Gaussian  series  where  the  left-hand  inequality  is  correct  up  to  constant  fac¬ 
tors,  while  there  are  other  examples  that  saturate  the  right-hand  inequality.  Later  in  this  chapter, 
when  we  turn  to  applications,  we  will  encounter  both  of  these  cases  (and  more).  In  Chapter  7, 
we  will  show  how  to  moderate  the  dimensional  factor,  but  we  cannot  remove  it  entirely  using 
current  techniques. 

What  about  the  tail  bound  (4.2.3)  for  the  norm  of  the  Gaussian  series?  Here,  our  results  are 
less  impressive.  It  turns  out  that  the  large -deviation  behavior  of  a  Gaussian  series  is  controlled 
by  a  different  parameter  cr2  called  the  weak  variance.  There  are  cases  where  the  weak  variance 
cr2  is  substantially  smaller  than  the  variance  cr2,  which  means  that  the  tail  bound  (4.2.3)  can 
badly  overestimate  the  tail  probability  when  the  level  t  is  large.  Fortunately,  this  problem  is 
less  pronounced  with  the  matrix  Chernoff  inequalities  of  Chapter  5  and  the  matrix  Bernstein 
inequalities  of  Chapter  6. 

In  short,  the  primary  value  of  matrix  concentration  inequalities  inheres  in  the  estimates  that 
they  provide  for  the  expectation  of  the  norm  (maximum  eigenvalue,  minimum  eigenvalue)  of  a 
random  matrix.  In  many  cases,  they  also  provide  reasonable  information  about  the  tail  decay, 
but  there  are  other  situations  where  the  tail  bounds  are  depressingly  feeble. 

4.4  Example:  Some  Gaussian  Matrices 

Let  us  begin  by  applying  our  tools  to  two  types  of  Gaussian  matrices  that  have  been  studied  ex¬ 
tensively  in  the  classical  literature  on  random  matrix  theory.  In  these  cases,  precise  information 
about  the  eigenvalue  distribution  is  available,  which  provides  a  benchmark  for  assessing  our  re¬ 
sults.  We  find  that  bounds  based  on  Theorem  4.1.1  and  Corollary  4.2.1  lead  to  very  reasonable 
estimates  but  they  are  not  sharp.  We  can  reach  similar  conclusions  for  matrices  with  indepen¬ 
dent  Rademacher  entries. 

4.4.1  Gaussian  Wigner  Matrices 

We  begin  with  a  family  of  Gaussian  Wigner  matrices.  A  d  x  d  matrix  W(j  from  this  ensemble 
is  real-symmetric  with  a  zero  diagonal;  the  entries  above  the  diagonal  are  independent  normal 
variables  with  mean  zero  and  variance  one: 


0 

712 

713 

7 14 

712 

0 

723 

72  d 

Wd  = 

713 

723 

0 

Y3d 

.7i  d 

72  d 

•••  Td-l,d 

0 

where  {jjk  :  1  <  j  <  k  <  d]  is  an  independent  family  of  standard  normal  variables.  We  can  repre¬ 
sent  this  matrix  more  compactly  as  a  Gaussian  series: 

w  =  Y  rjk&jk+Vkj) 

lsj<k<d 


(4.4.1) 
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It  is  known  that 


—j=-  ^maiSWd) 

v  d 


2  almost  surely  as  d  — *•  oo. 


(4.4.2) 


To  make  (4.4.2)  precise,  we  assume  that  \W,i\  is  a  sequence  of  independent  Gaussian  Wigner 
matrices,  indexed  by  the  dimension  d. 

Theorem  4.1.1  provides  a  simple  way  to  bound  the  maximum  eigenvalue  of  a  Gaussian  Wigner 
matrix.  We  just  need  to  compute  the  variance  (t2(Wc{).  To  that  end,  note  that  the  sum  of  the 
squared  coefficient  matrices  takes  the  form 

E  (EJk  +  Ekjf  =  £  {Ejj  +  Ekk)  -  [d-  l)Id- 

1  <j<k<d  l<j<k<d 

We  have  used  the  fact  that  EjkEkj  -  Ejj,  while  EjkEjk  -  0  because  the  limits  of  the  summation 
ensure  that  j  /  k.  We  see  that 

ff2(Wrd)  =  ||(d-i)id||  =  d-i. 

The  bound  (4.1.3)  for  the  expectation  of  the  maximum  eigenvalue  gives 

EAmax(^)  <  ^(d-Dlogd.  (4.4.3) 

In  conclusion,  our  techniques  overestimate  the  maximum  eigenvalue  of  Wci  by  a  factor  of  ap¬ 
proximately  ^/0.51ogd.  Our  result  (4.4.3)  is  not  perfect,  but  it  only  takes  two  lines  of  work.  In 
contrast,  the  classical  result  (4.4.2)  depends  on  a  long  moment  calculation  that  involves  chal¬ 
lenging  combinatorial  arguments. 


4.4.2  Rectangular  Gaussian  Matrices 

Next,  we  consider  a  d\  x  d2  rectangular  matrix  with  independent  standard  normal  entries: 


G  = 


where  {jjk}  is  an  independent  family  of  standard  normal  variables.  We  can  express  this  matrix 
efficiently  using  a  Gaussian  series: 

d\  dz 

G  =  E  E  rjkVjk, 

]=lk=l 


r  it 

712 

713 

Tld2 

721 

722 

723  ■ 

72  d2 

T  dil 

7di2 

7  d\3  ■ 

■  •  Y  d\  dz 

For  this  matrix,  the  literature  contains  an  elegant  estimate  of  the  form 

E||G||  <  y/d\  +  sfdz- 


(4.4.4) 


The  inequality  (4.4.4)  is  saturated  when  d\  and  dk  tend  to  infinity  with  the  ratio  d \  / d-k  fixed. 

Corollary  4.2.1  yields  another  bound  on  the  expected  norm  of  the  matrix  G.  In  order  to  com¬ 
pute  the  variance  o 2  (G),  we  form  the  sums  of  squared  coefficients: 

d\  dz  d\  dz 

E  E  EjkE*jk  =  E  E  EJJ  =  d2  ldi  ■  and 

j-l k-1  j=lk=l 

d\  dz  d\  dz 

E  E  EjkEjk  ~  E  E  Ekk  ~  ^d2- 

j=lk=l  j-l k-l 
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The  matrix  variance  (4.2.1)  is 

a2 ( G)  =  max { || d2ldl  || ,  ||<Md2 1|}  =  maxfdi,  d2}. 


We  conclude  that 

E||G||  <  ^/2max{di,  d2}log{di  +  d2).  (4.4.5) 

The  leading  term  is  roughly  correct  because 

\fd\  +  \[~d2  <  \/2max{di,  d2}  <  \/2  |\/di  +  x/tfej  ■ 

The  logarithmic  factor  in  (4.4.5)  does  not  belong,  but  it  is  rather  small  in  comparison  with  the 
leading  terms.  Once  again,  we  have  produced  a  good  result  with  a  minimal  amount  of  effort.  In 
contrast,  the  proof  of  (4.4.4)  depends  on  a  miraculous  application  of  a  comparison  theorem  for 
Gaussian  processes. 

4.5  Example:  Matrices  with  Randomly  Signed  Entries 

Next,  we  turn  to  an  example  that  superficially  appears  similar  to  the  matrix  discussed  in  §4.4.2 
but  is  much  less  understood.  Consider  a  fixed  d\  x  d2  matrix  B  with  real  entries,  and  let  {£jk\  be 
an  independent  family  of  Rademacher  random  variables.  Consider  the  d\  x  d2  random  matrix 

d\  d'i 

—  E  E  £ jk^jk^jk 

j=l  k=l 

In  other  words,  we  obtain  the  random  matrix  B+  be  randomly  flipping  the  sign  of  each  entry  of 
B.  The  literature  contains  the  following  bound  on  the  expected  norm  of  this  matrix: 

E || B±||  <  Const-<7-log1/4(min{di,  d2 }),  (4.5.1) 

where  the  leading  factor 

(7  ~  max  jmaxj  ||  bj- 1| ,  max*;  ||fo:Jfc||} .  (4.5.2) 

We  have  written  bj  for  the  j th  row  of  B  and  b±  for  the  /cth  column  of  B.  In  other  words,  the 
expected  norm  of  a  matrix  with  randomly  signed  entries  is  comparable  with  the  maximum  Eu¬ 
clidean  norm  achieved  by  any  row  or  column.  There  are  cases  where  the  bound  (4.5.1)  admits  a 
matching  lower  bound. 

Corollary  4.2. 1  leads  to  a  quick  proof  of  a  slightly  weaker  result.  We  simply  need  to  compute 
the  variance  cr2(B+ ).  To  that  end,  note  that 


d\  d,2  d\ 

E  Z^jk^jkHb]kE]kr  =  j: 

j=ik=i  j=i 


d2  ) 

l|hl:H2 

E  \bJk\ 

U=i 

Hi  II 2 

Similarly, 


d\  d2  d2 

E  Ztb]kE]knb]kEjk)='£ 

j=lk=l  k-l 


rfi 

E \bik\2 

E  kk  = 

\\b:1\\2 

\j= 1 

Hi  II2 

\\b:d2  || 
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Therefore,  the  variance  (4.2.1)  is 


a  (B±)  =  max 


tt^jkEjk)ibjkEjkr 


j=\k=l 


tt^jknbjkEjk) 


j=lk=l 


=  maxjmaxy  ||foj.||2,  max*.  ||ft:/t||2J-  ■ 

We  see  that  <j[B+)  coincides  with  a,  the  leading  term  (4.5.2)  in  the  established  estimate  (4.5.1)! 
Now,  Corollary  4.2.1  delivers  the  bound 

E  || B±  ||  <  V2-a{B±)  •  log1/2(d!  +  d2).  (4.5.3) 


Observe  that  the  estimate  (4.5.3)  for  the  norm  matches  the  correct  bound  (4.5.1)  up  to  a  factor 
of  log1/4(max{di,  d2 }).  Yet  again,  we  obtain  a  result  that  is  respectably  close  to  the  optimal  one, 
even  though  it  is  not  quite  sharp. 

The  main  advantage  of  using  results  like  Corollary  4.2.1  to  analyze  this  random  matrix  is 
that  we  can  obtain  a  good  result  with  a  minimal  amount  of  arithmetic.  The  analysis  that  leads 
to  (4.5.1)  involves  a  long  sequence  of  combinatorial  arguments. 


4.6  Example:  Gaussian  Toeplitz  Matrices 

Matrix  concentration  inequalities  offer  very  effective  tools  for  analyzing  random  matrices  that 
involve  dependency  structures  that  are  more  complicated  than  the  classical  ensembles.  In  this 
section,  we  consider  Gaussian  Toeplitz  matrices,  which  have  applications  in  signal  processing. 

We  construct  an  (unsymmetric)  dx  d  Gaussian  Toeplitz  matrix  R  by  populating  the  first  row 
and  first  column  of  the  matrix  with  independent  standard  normal  variables;  the  entries  along 
each  diagonal  of  the  matrix  take  the  same  value: 


To 

Ti 

7d-i 

7-1 

To 

7i 

7-i 

7o 

7i 

7-1 

7o 

7i 

T-(d-l) 

7-1 

7o 

where  [yk }  is  a  family  of  independent  standard  normal  variables.  As  usual,  we  represent  the 
Gaussian  Toeplitz  matrix  as  a  matrix  Gaussian  series: 

d- t  d- l 

Rrf  =  roi+  ErtSl'+  £y_fc(sfc)*,  (4.e.i) 

k= 1  k= 1 

where  S  is  the  shift-up  operator  acting  on  d-dimensional  column  vectors: 


0  1 
0 


S  = 


0  1 
0 
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It  follows  that  Sk  shifts  a  vector  up  by  k  places,  introducing  zeros  at  the  bottom,  while  (Sk)  *  shifts 
a  vector  down  by  1c  places,  introducing  zeros  at  the  top. 

We  can  analyze  this  example  quickly  using  Corollary  4.2.1.  First,  note  that 

(SkHSk)"=  XEjj  and  (Skf  (Sk)  =  £  Ejj. 
j= 1  j=k+\ 

To  obtain  the  variance  parameter  (4.2.1),  we  calculate  the  sum  of  the  “squares”  of  the  coefficient 
matrices  that  appear  in  (4.6.1).  In  this  instance,  the  two  terms  in  the  matrix  variance  are  the 
same.  We  find  that 


d-l  d- 1  d- 1 

I2  +  E  (Sk)  ( sk )  *  +  £  (Sfc)  *  (Sfc)  =  i  +  £ 

k=  1  k=  1  A:=  1 


d-k 


E  % 

j=i 


d 

+  E  E;; 

l=fc+i 


d 


=  E 

i=i 


4-y  j-i 

i+Ei+Ei 


Ar=l  A:=l 


Eft  = 


d 


^(i+(d-i)+a-i))Ej7  =  dirf. 
;=i 


(4.6.2) 


In  the  second  line,  we  (carefully)  switch  the  order  of  summation  and  rewrite  the  identity  matrix 
as  a  sum  of  diagonal  matrix  units.  We  reach 

(72(Rrf)  =  ||dIrf||=d. 


An  application  of  Corollary  4.2.1  leads  us  to  conclude  that 

E\\Rd\\<^2d\og(2d).  (4.6.3) 

It  turns  out  that  the  inequality  (4.6.3)  is  correct  up  to  the  precise  value  of  the  constant,  which 
does  not  seem  to  be  known.  In  other  words, 

E  ||  Rd  || 

const  <  —^=^=  <  Const  as  a  — *•  oo. 

\Jd\ogd 

Here,  we  take  {R^}  to  be  a  sequence  of  unsymmetric  Gaussian  Toeplitz  matrices,  indexed  by  the 
ambient  dimension  d. 


4.7  Application:  Rounding  for  the  MaxQP  Relaxation 

Our  final  application  involves  a  more  substantial  question  in  combinatorial  optimization.  One 
of  the  methods  that  has  been  proposed  for  solving  a  certain  optimization  problem  leads  to  a 
matrix  Rademacher  series,  and  the  analysis  of  this  method  requires  the  spectral  norm  bounds 
from  Corollary  4.2.1.  A  detailed  treatment  would  take  us  too  far  afield,  so  we  just  sketch  the 
context  and  indicate  how  the  random  matrix  arises. 

There  are  many  types  of  optimization  problems  that  are  computationally  difficult  to  solve  ex¬ 
actly.  One  approach  to  solving  these  problems  is  to  enlarge  the  constraint  set  in  such  a  way  that 
the  problem  becomes  tractable,  a  process  called  "relaxation.”  After  solving  the  relaxed  problem, 
we  can  “round”  the  solution  to  ensure  that  it  falls  in  the  constraint  set  for  the  original  problem.  If 
we  can  perform  the  rounding  step  without  changing  the  value  of  the  objective  function  substan¬ 
tially,  then  the  rounded  solution  is  also  a  decent  solution  to  the  original  optimization  problem. 
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One  difficult  class  of  optimization  problems  involves  maximizing  a  quadratic  form  subject 
to  a  set  of  quadratic  constraints  and  a  spectral  norm  constraint.  This  problem  is  referred  to  as 
MaxQP.  The  desired  solution  Z  to  this  problem  is  a  d\  x  rf2  matrix.  The  solution  needs  to  satisfy 
several  different  requirements,  but  we  focus  on  the  condition  that  \\Z\\  <  1. 

There  is  a  natural  relaxation  of  the  MaxQP  problem  that  has  been  studied  for  the  last  decade 
or  so.  When  we  solve  the  relaxation,  we  obtain  a  family  { B ^  :  k  -  1,2,...,  n}  of  d\  x  d2  matrices 
that  satisfy  the  constraints 


n  n 

E  BkBk  ^  ldi  and  E  Bk  Bk  =4  ld2  ■ 

k= 1  k= 1 

In  fact,  these  two  bounds  are  part  of  the  specification  of  the  relaxed  problem.  To  round  the  family 
of  matrices  back  to  a  solution  Y  of  the  original  problem,  we  form  the  random  matrix 

n 

-Z  =  a  E  ekBk, 

k=  l 


where  {ek  :  k  —  is  a  family  of  independent  Rademacher  random  variables.  The  scaling 

factor  a  >  0  can  be  adjusted  to  guarantee  that  the  norm  constraint  holds  with  high  probability. 
What  is  the  expected  norm  of  Z?  Corollary  4.2.1  yields 


E||Z||  <  yj 2a2  (Z)  log(d1  +  d2). 
Here,  the  variance  parameter  satisfies 


cr2(Z)  =  a2  max 


LBkK 


k=  1 


E  KBk 


k=  1 


<  a 


owing  to  the  properties  of  the  matrices  It  follows  that  the  scaling  parameter  a  should 

satisfy 


2  log  [dl  +  d2) 


to  ensure  that  E  ||  Z||  <  1 .  For  this  choice  of  a,  the  rounded  solution  Z  observes  the  spectral  norm 
constraint  on  average. 

The  important  fact  here  is  that  the  scaling  parameter  a  is  usually  small  as  compared  with 
the  other  parameters  of  the  problem  {d\,d2,  n,  and  so  forth).  Therefore,  the  scaling  does  not 
have  a  massive  effect  on  the  value  of  the  objective  function.  Ultimately,  this  approach  leads  to  a 
technique  for  solving  the  MaxQP  problem  that  produces  a  feasible  point  whose  objective  value 
is  within  a  factor  of  y'^logfdi  +  d'i)  of  the  maximum  objective  value  possible. 


4.8  Proof  of  Bounds  for  Hermitian  Matrix  Series 

We  continue  with  the  proof  that  matrix  Gaussian  series  exhibit  the  behavior  described  in  Theo¬ 
rem  4.1.1.  Afterward,  we  show  how  to  adapt  the  argument  to  address  matrix  Rademacher  series. 


4.8.  PROOF  OF  BOUNDS  FOR  HERMITIAN  MATRIX  SERIES 
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4.8.1  Hermitian  Gaussian  Series 


Our  main  tool  is  the  Theorem  3.6.1,  the  set  of  master  bounds  for  independent  sums.  To  use  this 
result,  we  must  identify  the  cgf  of  a  fixed  matrix  modulated  by  a  Gaussian  random  variable. 

Lemma  4.8.1  (Gaussian  x  Matrix:  MgfandCgf).  Suppose  that  A  is  a  fixed  Hermitian  matrix,  and 
letj  be  a  standard  normal  random  variable.  Then 

EerBA  =  ee2ji212  and  log  Eer0A  =  y  A2  for 9  e  R. 

Proof.  We  may  assume  9  —  1  by  absorbing  9  into  the  matrix  A.  It  is  well  known  that  the  moments 
of  a  standard  normal  variable  satisfy 

E(y2p+1)  =  0  and  E(y2p)  =  for  p  =  0, 1,2,. ... 

The  formula  for  the  odd  moments  holds  because  a  standard  normal  variable  is  symmetric.  One 
way  to  establish  the  formula  for  the  even  moments  is  to  use  integration  by  parts  to  obtain  a 
recursion  for  the  (2p)th  moment  in  terms  of  the  (2  p  —  2)th  moment. 

Therefore,  the  matrix  mgf  satisfies 

Eer-.=I+f  Kr!5i!!  =  I 

(2  PV-  P=1  p! 

The  first  identity  holds  because  the  odd  terms  in  the  series  vanish.  To  compute  the  cgf,  we  extract 
the  logarithm  of  the  mgf  and  recall  (2.1.8),  which  states  that  the  matrix  logarithm  is  the  functional 
inverse  of  the  matrix  exponential.  □ 


The  results  for  the  maximum  and  minimum  eigenvalues  of  a  matrix  Gaussian  series  follow 
easily. 


Proof  of  Theorem  4.1.1:  Gaussian  Case.  Consider  a  finite  sequence  lAfj  of  Hermitian  matrices, 
and  let  [yU  be  a  finite  sequence  of  independent  standard  normal  variables.  Define  the  matrix 
Gaussian  series 

Y  =  LknAk. 

We  begin  with  the  upper  bound  (4.1.3)  for  EAmax(F).  The  master  expectation  bound,  relation  (3.6.1) 
from  Theorem  3.6.1,  implies  that 


EAmax(F)  <  inf  - 
0>o  9 


log  Etrexp  |^fclog  EeykBAk^ 


1  ( 0 ^ 

inf  -logtrexp  — 


0>o  9 

<  inf  -  log 
0>o  9  5 

=  inf  -  log 
0>o  9  6 


d  An 


1 02 

exP[yLfcA 


(  AmaxEfc-Ajfc) 


dexp|y 

^max 


=  inf  - 
0>o  9 


log  d  + 


92o2 

2 


The  second  line  follows  when  we  introduce  the  cgf  from  Lemma  4.8.1.  To  reach  the  third  in¬ 
equality,  we  bound  the  trace  by  the  dimension  times  the  maximum  eigenvalue.  The  fourth  line 
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is  the  Spectral  Mapping  Theorem,  Proposition  2.1.3.  Identify  the  variance  parameter  (4.1.2)  in 
the  exponent.  The  infimum  is  attained  at  0  -  logd,  which  leads  to  (4.1.3). 

Next,  we  turn  to  the  proof  of  the  upper  tail  bound  (4.1.4)  for  Amax(Y).  Invoke  the  master  tail 
bound,  relation  (3.6.3)  from  Theorem  3.6.1,  and  calculate  that 

P{Amax(F)  <  t]  <  inf  e~Bt  trexp  (£fclog  Ee^0Ai) 

<  inf  e~et  ■  d  exp  i^—Amax  =  d  inf  e“0t+02‘J“/2. 

e>o  1,  2  l~‘K  L  )  e> o 

The  steps  here  are  the  same  as  in  the  previous  calculation.  The  infimum  is  achieved  at  6  —  tier2, 
which  yields  (4.1.4).  □ 


4.8.2  Hermitian  Rademacher  Series 

The  results  for  matrix  Rademacher  series  involve  arguments  closely  related  to  the  proofs  for  ma¬ 
trix  Gaussian  series,  but  we  require  one  additional  piece  of  reasoning  to  obtain  the  simplest  re¬ 
sults.  First,  let  us  compute  bounds  for  the  matrix  mgf  and  cgf  of  a  Hermitian  matrix  modulated 
by  a  Rademacher  random  variable. 


Lemma  4.8.2  (Rademacher  x  Matrix:  Mgf  and  Cgf).  Suppose  that  A  is  a  fixed  Hermitian  matrix, 
and  let  e  be  a  Rademacher  random  variable.  Then 

Ee£0A  =$  e02A2/2  and  log  Ee£0A  =$  y  A2  forde  U. 

Proof.  First,  we  establish  a  scalar  inequality.  Comparing  Taylor  series, 


“  a2P  “  a2P 

cosh(a)  = 


Cl  •  ,  U  •  'Z  10 

=  e  2  for  aeU. 


P=o(2  p)\  £02  Pp\ 


(4.8.1) 


The  inequality  holds  because  (2 p)!  >  [2p)[2p  -  2)  -  -  -  (4) (2)  -2 Pp\. 

To  compute  the  matrix  mgf,  we  may  assume  9  -  1.  By  direct  calculation, 

Ee£A  =  -eA  +  -e“A  =  cosh(^)  =$  eA2/2. 

2  2 

The  semidefinite  bound  follows  when  we  apply  the  Transfer  Rule  (2.1.6)  to  the  inequality  (4.8.1). 
To  determine  the  matrix  cgf,  observe  that 

log  EefA  =  logcosh(d)  ^A2. 

The  semidefinite  bound  follows  when  we  apply  the  Transfer  Rule  (2.1.6)  to  the  bound  logcosh(a)  < 
a2 12  for  aeR,  which  is  a  consequence  of  (4.8.1).  □ 


We  are  prepared  to  develop  probability  inequalities  for  the  extreme  eigenvalues  of  a  Rademacher 
series  with  matrix  coefficients. 


4.9.  PROOF  OF  BOUNDS  FOR  RECTANGULAR  MATRIX  SERIES 
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Proof  of  Theorem  4.1.1:  Rademacher  Case.  Consider  a  finite  sequence  \Ak\  of  Hermitian  matri¬ 
ces,  and  let  {ek}  be  a  finite  sequence  of  independent  standard  normal  variables.  Define  the  ma¬ 
trix  Rademacher  series 

Y  = 

The  bounds  for  the  extreme  eigenvalues  of  F  follow  from  an  argument  almost  identical  with  the 
proof  in  the  Gaussian  case.  The  only  point  that  requires  justification  is  the  inequality 

trexp  (Lfclog  Ee£*eAt)  <  trexp  (y  Y,kAk)  • 

To  obtain  this  result,  we  introduce  the  semidefinite  bound,  Lemma  4.8.2,  for  the  Rademacher 
cgf  into  the  trace  exponential.  The  left-hand  side  increases  after  this  substitution  because  of 
the  fact  (2.1.7)  that  the  trace  exponential  function  is  monotone  with  respect  to  the  semidefinite 
order.  £3: 


4.9  Proof  of  Bounds  for  Rectangular  Matrix  Series 


Next,  we  consider  a  series  with  rectangular  matrix  coefficients  modulated  by  independent  Gaus¬ 
sian  or  Rademacher  random  variables.  The  bounds  for  the  norm  of  a  rectangular  series  follow 
instantly  from  the  the  bounds  for  the  norm  of  an  Hermitian  series  because  of  a  formal  device: 
We  simply  apply  the  Hermitian  results  to  the  Hermitian  dilation  (2.1.11)  of  the  series. 


Proof  of  Corollary  4.2.1.  Consider  a  finite  sequence  {BO  of  d\  x  d2  complex  matrices,  and  let  \f  0 
be  a  finite  sequence  of  independent  random  variables,  either  standard  normal  or  Rademacher. 
Recall  from  Definition  2.1.5  that  the  Hermitian  dilation  is  the  map 


Jif :  B  • — * 


0 

B* 


B 

0 


This  leads  us  to  form  the  two  series 


z  =  LktkBk  and  Y  =  jr(Z). 

To  analyze  ||Z||,  we  wish  to  invoke  Theorem  4.1.1.  To  make  this  step,  we  apply  the  fact  (2.1.13) 
that  the  Hermitian  dilation  preserves  spectral  information: 


IIZ||  =  Amax(^(Z))  =  Amax(F). 


Therefore,  bounds  on  Amax(F)  deliver  bounds  on  ||Z||.  To  use  these  results,  we  must  express  the 
variance  (4.1.2)  of  the  random  Hermitian  matrix  Y  in  terms  of  the  general  matrix  Z.  Observe 
that 


ct2(fH|e(f2)||  =  ||Epnz)2)|  = 


=  max{  ||  E(ZZ* )  || ,  ||  E(Z*  Z)  || }  =  a2  (Z) . 

The  third  relation  is  the  identity  (2.1.12)  for  the  square  of  the  Hermitian  dilation.  The  penulti¬ 
mate  equation  holds  because  the  norm  of  a  block-diagonal  matrix  is  the  maximum  norm  of  any 
diagonal  block.  We  obtain  the  formula  (4.2.1)  for  the  variance  of  the  matrix  Z. 

We  are,  finally,  prepared  to  apply  Theorem  4.1.1,  whose  conclusions  lead  to  the  statement  of 
Corollary  4.2.1.  O 


E(ZZ*)  0 
0  E(Z*Z) 


ZZ* 

0 


0 

z*z 
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4.10  Notes 

The  material  in  this  chapter  is,  perhaps,  more  firmly  established  than  anything  else  in  these 
lecture  notes.  We  give  an  overview  of  research  related  to  matrix  Gaussian  series,  along  with  ref¬ 
erences  for  the  specific  random  matrices  that  we  have  analyzed. 

4.10.1  Matrix  Gaussian  and  Rademacher  Series 

The  main  results,  Theorem  4.1.1  and  Corollary  4.2.1,  have  an  interesting  history.  In  the  precise 
form  stated  here,  these  two  results  first  appeared  in  [Trolld],  but  we  can  trace  them  back  more 
than  two  decades. 

In  his  work  [OlilOb,  Thm.  1],  Oliveira  established  the  mgf  bounds,  Lemma  4.8.1  and  Lemma  4.8.2. 
He  also  developed  an  ingenious  improvement  on  the  arguments  of  Ahlswede  and  Winter  [AW02, 
App.]  that  gives  a  bound  similar  with  Theorem  4.1.1.  The  constants  in  Oliveira’s  result  are  a  bit 
worse,  but  the  dependence  on  the  dimension  is  sometimes  better.  We  do  not  believe  that  the 
original  approach  of  Ahlswede-Winter  can  deliver  any  of  these  results. 

It  turns  out  that  Theorem  4.1.1  is  roughly  comparable  with  the  noncommutative  Khintchine 
inequality  [LP86].  The  noncommutative  Khintchine  inequality  provides  a  bound  for  the  ex¬ 
pected  trace  of  an  even  power  of  a  matrix  Gaussian  series  (or  a  matrix  Rademacher  series)  in 
terms  of  the  variance  of  the  series.  The  sharpest  forms  [LPP91,  BucOl,  Buc05]  are  slightly  more 
powerful  than  Theorem  4.1.1.  Unfortunately,  established  proofs  of  the  noncommutative  Khint¬ 
chine  inequality  are  abstract  or  difficult  or  both.  Recently,  the  paper  [MJC+12]  propounded  an 
elementary  proof,  based  on  Stein’s  method  of  exchangeable  pairs  [Ste72,  Cha07]. 

For  a  detailed  exploration  of  the  relationships  between  matrix  concentration  inequalities  and 
noncommutative  moment  inequalities,  see  [Trolld,  Sec.  4].  This  discussion  also  indicates  the 
extent  to  which  Theorem  4.1.1  and  its  relatives  are  sharp. 

Recently,  there  have  been  some  minor  improvements  to  the  dimensional  factor  that  appears 
in  Theorem  4.1.1.  We  discuss  these  results  and  give  citations  in  Chapter  7. 

4.10.2  Application  to  Random  Matrices 

It  has  also  been  known  for  a  long  time  that  results  such  as  Theorem  4.1.1  can  be  used  to  study 
random  matrices. 

We  believe  that  the  functional  analysis  literature  contains  the  earliest  applications  of  ma¬ 
trix  concentration  results  to  analyze  random  matrices.  In  a  well-known  paper  [Rud99],  Mark 
Rudelson — acting  on  a  suggestion  of  Gilles  Pisier — showed  how  to  use  the  noncommutative 
Khintchine  inequality  to  study  a  problem  connected  with  covariance  estimation.  This  work  led 
to  a  significant  amount  of  activity,  in  which  researchers  used  variants  of  Rudelson’s  argument  to 
prove  other  types  of  results.  See,  for  example,  the  paper  [RV07].  This  approach  is  very  powerful, 
but  it  tends  to  require  some  effort  to  use. 

In  parallel,  other  researchers  in  noncommutative  probability  theory  also  came  to  recognize 
the  power  of  noncommutative  moment  inequalities  in  random  matrix  theory.  See  the  paper  [JX08] 
for  a  specific  example.  Unfortunately,  this  literature  is  technically  formidable,  which  makes  it 
difficult  for  outsiders  to  appreciate  its  achievements. 

The  work  [AW02]  of  Ahslwede  and  Winter  led  to  first  “finished”  matrix  concentration  inequal¬ 
ities,  of  the  type  that  we  describe  in  these  lecture  notes.  For  the  first  few  years  after  this  work, 
most  of  the  applications  concerned  quantum  information  theory  and  random  graph  theory.  The 
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paper  [Groll]  introduced  the  Ahslwede-Winter  method  to  researchers  in  mathematical  signal 
processing  and  statistics,  and  it  served  to  popularize  matrix  concentration  bounds. 

At  this  point,  the  available  matrix  concentration  inequalities  were  still  significantly  subop- 
timal.  The  main  advances,  in  [OlilOa,  Trolld],  led  to  nearly  optimal  matrix  concentration  re¬ 
sults  of  the  kind  that  we  present  in  these  lecture  notes.  These  results  allow  researchers  to  obtain 
reasonably  accurate  analyses  of  a  wide  variety  of  random  matrices  with  very  little  effort.  New 
applications  of  these  ideas  now  appear  on  a  weekly  basis. 

4.10.3  Wigner  and  Marcenko-Pastur 

Wigner  matrices  first  emerged  in  the  literature  on  nuclear  physics,  where  they  were  used  to 
model  the  Hamiltonians  of  heavy  atoms  [Meh04] .  Wigner  showed  that  the  limiting  spectral  dis¬ 
tribution  of  a  Wigner  matrix  follows  the  semicircle  law;  see  [Taol2,  §2.4]  for  an  overview  of  the 
proof.  The  Bai-Yin  law  [BY93]  states  that,  up  to  scaling,  the  maximum  eigenvalue  of  a  Wigner 
matrix  converges  almost  surely  to  two.  See  [Taol2,  §2.3]  for  a  detailed  treatment.  The  analysis 
that  we  present  here,  using  Theorem  4. 1 . 1 ,  is  drawn  from  [Tro lid,  §4] . 

The  first  analysis  of  a  rectangular  Gaussian  matrix  is  due  to  Marcenko  and  Pastur  [MP67], 
who  established  that  the  limiting  distribution  of  the  squared  singular  values  follows  a  semicircu¬ 
lar  distribution.  The  Bai-Yin  law  [BY93]  gives  an  almost  sure  limit  for  the  largest  singular  value  of 
a  rectangular  Gaussian  matrix.  The  expectation  bound  (4.4.4)  appears  in  a  survey  article  [DS02] 
by  Davidson  and  Szarek.  The  expectation  bound  is  ultimately  derived  from  a  comparison  the¬ 
orem  for  Gaussian  processes  due  to  Fernique  and  amplified  by  Gordon  [Gor85] .  Our  approach, 
using  Corollary  4.2.1,  is  based  on  [Trolld,  §4]. 

4.10.4  Randomly  Signed  Matrices 

Matrices  with  randomly  signed  entries  have  not  received  much  attention  in  the  literature.  The 
result  (4.5.1)  is  due  to  Yoav  Seginer  [SegOO].  There  is  also  a  well-known  paper  [Lat05]  by  Rafal 
Latala  that  provides  a  bound  for  the  expected  norm  of  a  Gaussian  matrix  whose  entries  have 
nonuniform  variance.  The  analysis  here,  using  Corollary  4.2.1,  appears  in  [Trolld,  §4]. 

4.10.5  Gaussian  Toeplitz  Matrices 

Research  on  random  Toeplitz  matrices  is  quite  recent,  but  there  are  now  a  number  of  papers 
available.  Bryc,  Dembo,  and  Jiang  obtained  the  limiting  spectral  distribution  of  a  symmetric 
Toeplitz  matrix  based  on  iid  random  variables  [BDJ06] .  Later,  Mark  Meckes  established  the  first 
bound  for  the  expected  norm  of  a  random  Toeplitz  matrix  based  on  iid  random  variables  [Mec07] . 
More  recently,  Sen  and  Virag  computed  the  limiting  value  of  the  expected  norm  of  a  random, 
symmetric  Toeplitz  matrix  whose  entries  have  identical  second-order  statistics  [SV1 1] .  See  the 
latter  paper  for  additional  references.  The  analysis  here,  based  on  Corollary  4.2.1,  is  new. 

4.10.6  Relaxation  and  Rounding  of  MaxQP 

The  idea  of  using  semidefinite  relaxation  and  rounding  to  solve  the  MaxQP  problem  is  due  to 
Arkadi  Nemirovski  [Nem07] .  He  obtained  nontrivial  results  on  the  performance  of  his  method 
using  some  matrix  moment  calculations,  but  he  was  unable  to  reach  the  sharpest  possible  bound. 
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Anthony  So  [So09]  pointed  out  that  matrix  moment  inequalities  could  be  used  to  obtain  an  op¬ 
timal  result;  he  also  showed  that  matrix  concentration  inequalities  have  applications  to  robust 
optimization.  The  presentation  here,  using  Corollary  4.2.1,  is  essentially  equivalent  with  the  ap¬ 
proach  in  [So09],  but  we  have  achieved  slightly  better  bounds  for  the  constants. 


A  Sum  of  Random 
Positive-Semidefinite  Matrices 


This  chapter  presents  matrix  concentration  inequalities  that  are  analogous  with  the  classical 
Chernoff  bounds.  In  the  matrix  setting,  Chernoff-type  inequalities  allow  us  to  study  the  extreme 
eigenvalues  of  an  independent  sum  of  random,  positive-semidehnite  matrices.  This  approach 
is  valuable  for  controlling  the  norm  of  a  random  matrix  and  for  understanding  when  a  random 
matrix  is  singular. 

More  formally,  we  consider  independent  random  matrices  Xn  with  the  properties 

X^  0  and  Amax(X/;)  <  R  for  each  k- 

Form  the  sum  Y  —  Y.k^-k-  Our  goal  is  to  study  the  expectation  and  tail  behavior  of  Arnax(FJ  and 
Amin  (10-  Matrix  Chernoff  inequalities  offer  all  of  these  estimates.  Note  that  it  is  better  to  use 
the  matrix  Bernstein  inequalities,  from  Chapter  6,  to  study  how  much  a  random  matrix  deviates 
from  its  mean. 

Bounds  on  the  maximum  eigenvalue  Amax(F)  give  us  information  about  the  norm  of  the  ma¬ 
trix  Y,  a  measure  of  how  much  the  matrix  can  dilate  a  vector.  Bounds  for  the  minimum  eigen¬ 
value  Amin(F)  tell  us  when  the  matrix  F  is  nonsingular;  they  also  provides  evidence  about  the 
norm  of  the  inverse  F-1,  when  it  exists. 

The  matrix  Chernoff  inequalities  are  quite  powerful,  and  they  have  numerous  applications. 
We  demonstrate  the  relevance  of  this  theory  by  considering  two  examples.  First,  we  show  how 
to  study  the  norm  of  a  random  submatrix  drawn  from  a  fixed  matrix,  and  we  explain  how  to 
check  when  the  random  submatrix  has  full  rank.  Second,  we  develop  an  analysis  to  determine 
when  a  random  graph  is  likely  to  be  connected.  These  two  problems  are  closely  related  to  basic 
questions  in  statistics  and  in  combinatorics. 

Section  5.1  presents  the  main  results  on  the  expectations  and  the  tails  of  the  extreme  eigen¬ 
values  of  a  sum  of  independent,  positive-semidehnite  random  matrices.  Section  1.6.3  describes 
the  application  to  sample  covariance  estimation,  while  §5.2  explains  how  the  matrix  Chernoff 
bounds  provide  spectral  information  about  a  random  submatrix  drawn  from  a  fixed  matrix.  Af¬ 
terward,  in  §5.4  we  explain  how  to  prove  the  main  results. 
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5. 1  The  Matrix  Chernoff  Inequalities 

In  the  scalar  setting,  the  Chernoff  inequalities  describe  the  behavior  of  a  sum  of  independent, 
positive  random  variables  that  are  subject  to  a  uniform  upper  bound.  These  results  are  often 
applied  to  study  the  number  Y  of  successes  in  a  sequence  of  independent — but  not  identical — 
Bernoulli  trials  with  relatively  small  probabilities  of  success.  In  this  case,  the  Chernoff  bounds 
show  that  Y  behaves  like  a  Poisson  random  variable.  The  random  variable  Y  concentrates  near 
the  expected  number  of  successes.  Its  lower  tail  has  Gaussian  decay  below  the  mean,  while  its 
upper  tail  drops  off  faster  than  an  exponential  random  variable. 

In  the  matrix  setting,  we  encounter  similar  phenomena  when  we  consider  a  sum  of  indepen¬ 
dent,  positive-semidefinite  random  matrices  whose  eigenvalues  meet  a  uniform  upper  bound. 
This  behavior  emerges  from  the  next  theorem,  which  closely  parallels  the  scalar  Chernoff  theo¬ 
rem. 


Theorem  5.1.1  (Matrix  Chernoff).  Consider  a  finite  sequence  {X^}  of  independent,  random,  Her- 
mitian  matrices  that  satisfy 

X k  Y-  0  and  Amax(^ffc)  —  F. 

Define  the  random  matrix 

Y  =  LkXk- 

Compute  the  expectation  parameters: 

Pmax  —  Pmax(Y)  —  Amax(P-^)  and  Pmin  —  Pmin ( Y)  —  Amin (P  Y) .  (5.1.1) 


Then,  for  6  >  0, 


.  e°-l  1  _  _ 

P  Amax  ( Y)  <  —  Pmax  +  R  log  d, 

o  o 

l-e~e  1 

E  Amjn(K)  >  —  Pmin  ~  R  log d. 

U  t) 


Furthermore, 


P5  {Amax  [Y)  —  (1  +  S)p  max}  —  d 


P5  {Amin  m<u-s)pmin}<d 


a+d)l+5 

as 


a-d)1-5 

The  proofs  of  Theorem  5.1.1  appears  below  in  §5.4. 


Mmaxl 


Minin  1 1! 


forS>  0,  and 


for8  e  [0, 1). 


(5.1.2) 

(5.1.3) 

(5.1.4) 

(5.1.5) 


5.1.1  Discussion 

First,  observe  that  we  can  easily  compute  the  matrix  expectation  parameters  /imax  and  pmin  in 
terms  of  the  coefficient  matrices: 


Pmax(Y)  —  Amax  (LfcEXfc)  and  FminiY)  —  Amin 


This  point  follows  from  the  linearity  of  expectation. 


5.2.  EXAMPLE:  A  RANDOM  SUBMATRIX  OF  A  FIXED  MATRIX 
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In  many  situations,  it  is  easier  to  work  with  streamlined  versions  of  the  bounds  from  Theo¬ 
rem  5.1.1: 


EAmax(F)  <  (e-  1)  Rmax  +  R  logd,  and  (5.1.6) 

EAWF)  >  (1  -  e-1)  /imjn  -  R  log d.  (5.1.7) 

We  obtain  these  results  by  selecting  6  —  1  in  both  (5.1.3)  and  (5.1.2).  Note  that,  in  the  scalar  case 
d  -  1,  we  can  take  8  — > •  0  to  obtain  a  numerical  constant  of  one  in  each  bound. 

These  simplifications  also  help  to  clarify  the  meaning  of  Theorem  5.1.1.  On  average,  Amax(F) 
is  not  much  larger  than  the  maximum  eigenvalue  /imax  of  the  mean  E  F  plus  a  fluctuation  term 
that  reflects  the  maximum  size  If  of  a  summand  and  the  ambient  dimension  d.  Similarly,  the 
average  value  of  Amin(F)  is  close  to  the  minimum  eigenvalue  /imin  of  the  mean  EF,  minus  a 
similar  fluctuation  term. 

We  can  weaken  the  tail  bounds  to  reach 

r  .  /eOMmax/t? 

P {Amax (F)  >  rpmax}  -d{~)  for  f  >  e,  and 

P {Amin (F)  <  tpmin}  <  de-(w)2flmin/2K  for  te  [0,1). 

The  first  bound  manifests  that  the  upper  tail  of  Amax(F)  decays  faster  than  an  exponential  ran¬ 
dom  variable  with  mean  RmaxIR .  The  second  bound  shows  that  the  lower  tail  of  Am;n(F)  decays 
as  fast  as  a  Gaussian  random  variable  with  variance  if  /pmin-  This  is  the  same  type  of  prediction 
we  receive  from  the  scalar  Chernoff  inequalities. 

5.2  Example:  A  Random  Submatrix  of  a  Fixed  Matrix 

The  matrix  Chernoff  inequality  plays  an  important  role  in  bounding  the  extreme  singular  val¬ 
ues  of  a  random  submatrix  drawn  from  a  fixed  matrix.  Although  Theorem  5.1.1  might  not  seem 
suitable  for  this  purpose  (since  it  deals  with  eigenvalues),  we  can  connect  the  problem  with  the 
method  via  a  simple  transformation.  The  results  in  this  section  have  found  applications  in  ran¬ 
domized  linear  algebra,  sparse  approximation,  and  other  fields. 


5.2.1  A  Random  Column  Submatrix 

Let  B  be  a  fixed  dx  n  matrix,  and  let  b-/-  denote  the  /cth  column  of  this  matrix.  The  matrix  can  be 
expressed  as  a  sum  of  columns: 

B=tb±<- 

k=  1 

The  symbol  e*.  refers  to  the  elementary  column  vector  with  a  one  in  the  /cth  component  and 
zeros  elsewhere;  the  length  of  the  vector  is  determined  by  context. 

We  consider  a  simple  model  for  a  random  column  submatrix.  Let  {1)0  be  an  independent 
sequence  of  Bernoulli  random  variables  with  common  mean  ql  n.  Define  the  random  matrix 

n 

Z=  E  Rkb-.kel- 

fc=  l 
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That  is,  we  include  each  column  independently  with  probability  q/n,  which  means  that  there 
are  typically  about  q  nonzero  columns  in  the  matrix.  We  do  not  remove  the  other  columns;  we 
just  zero  them  out. 

In  this  section,  we  will  obtain  bounds  on  the  expectation  of  the  extreme  singular  values  of 
the  dx  n  matrix  Z.  In  particular, 


E(cti(Z}2)  <  1.72^(Ti(fl)3  +  (logd)-maXj;  l|b:tll2,  and 

E [<jd (Z)2)  >  0.63  -  od {B)2  -  (log d)  •  maxfc  ||  bk ||2 . 
n 


(5.2.1) 


That  is,  the  random  submatrix  Z  gets  its  fair  share  of  the  spectrum  of  the  original  matrix  B.  There 
is  a  fluctuation  term  that  depends  on  largest  norm  of  a  column  of  B  and  the  logarithm  of  the 
number  d  of  rows  in  B.  This  result  is  very  useful  because  a  positive  bound  on  ad  (Z)  ensures  that 
the  nonzero  columns  of  the  random  submatrix  Z  are  linearly  independent,  at  least  on  average. 


The  Analysis 

To  study  the  singular  values  of  Z,  it  is  convenient  to  define  a  dx  d  random,  positive-semidefinite 
matrix 

Y  =  ZZ*  =  £  q jr]k{b,je*)(ekb*k)  =  £  qkb:kb*k. 

],k=  1  k=  1 

Note  that  q\  =  qk  because  qk  only  takes  the  values  zero  and  one.  The  eigenvalues  of  Y  determine 
the  singular  values  of  Z,  and  vice  versa.  In  particular, 

Amax(F)  =  Amax(ZZ*)  =  o-i(Z)2  and  Amin(F)  =  Amin(ZZ*)  =  ad(Z)2, 

where  we  arrange  the  singular  values  of  Z  in  weakly  decreasing  order  <j\  >  •  •  •  >  ad. 

The  matrix  Chernoff  inequality  provides  bounds  for  the  expectations  of  the  eigenvalues  of  Y . 
To  apply  the  result,  we  first  calculate 

E  r  =  E  (Et?  k)b:kb*k  =-£  bkb*k  =  —BB*, 
fc=i  n  fct'i  •*  n 


so  that 


Cf  p  C!  p 

Mmax  ~  O' \{H)  and  f^min  —  • 

n  n 


Define  R  =  max^  llfejtll  ,  and  observe  that  ^  R  for  each  k.  Theorem  5.1.1  now  ensures 

that 

E(<7i(Z)2)  =  EAmax(F)  <  — — —ai{B)z  +  R  logd,  and 

n 

E  [ad(Zf)  =  EAmin(F)  >  (1~6  )q  ad(B)2  -  R  log  d. 


We  have  taken  Q  —  1  in  the  upper  (5.1.2)  and  lower  (5.1.3)  bounds  for  the  expectation.  To  obtain 
the  stated  result  (5.2.1),  we  simply  introduce  numerical  estimates  for  the  constants. 


5.2.  EXAMPLE:  A  RANDOM  SUBMATRIX  OF  A  FIXED  MATRIX 
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5.2.2  A  Random  Row  and  Column  Submatrix 

Next,  we  consider  a  model  for  a  random  set  of  rows  and  columns  drawn  from  a  fixed  dx  n  matrix 
B.  In  this  case,  it  is  helpful  to  use  matrix  notation  to  represent  the  extraction  of  a  submatrix.  Let 

P  =  diag(?7i,...,77d)  and  Q  -  diag(fi,...,f„) 

where  {77^}  is  an  independent  family  of  Bernoulli  random  variables  with  common  mean  pld  and 
{£ 0  is  an  independent  family  of  Bernoulli  random  variables  with  common  mean  ql  n.  Then 

Z  =  PBQ 

is  a  random  submatrix  of  Z  with  about  p  nonzero  rows  and  q  nonzero  columns. 

In  this  section,  we  will  show  that 


E(||Z||2)  <  3  —  ||B 

a  n 


9  plogn 

112 +2 F  5 


max*.  ||b.j.| 
qlogd 


+  2 


maxj  |  bj: 


2]  +  (logdKlog/j)  max|h;fc|2.  (5.2.2) 
'  j’k 


The  notations  bj  and  b±  refer  to  the  /th  row  and  fcth  column  of  the  matrix  B,  while  bj  /c  is  the 
(7,  fc)  entry  of  the  matrix.  In  other  words,  the  random  submatrix  Z  gets  its  share  of  the  total 
norm  of  the  matrix  B.  The  fluctuation  terms  reflect  the  maxi  in  urn  row  norm  and  the  maximum 
column  norm  of  B,  as  well  as  the  size  of  the  largest  entry.  There  is  also  a  weak  dependence  on 
the  ambient  dimensions  d  and  n. 


The  Analysis 


The  argument  has  much  in  common  with  the  calculations  for  a  random  column  submatrix,  but 
we  need  to  do  some  extra  work  to  handle  the  interaction  between  the  random  row  sampling  and 
the  random  column  sampling. 

To  begin,  we  express  II  Z|| 2  in  terms  of  the  maximum  eigenvalue  of  a  random  positive-semidehnite 
matrix: 

E(||Z||2)  =  EAma X((PBQ)(PBQ)*) 


=  EAmax(BBQB*P)  =  E 


E 

xlmax 

L&fc(PB):*(PB)* 


We  have  used  the  facts  that  P  =  P*  and  that  QQ*  -  Q.  Invoking  the  matrix  Chernoff  inequal¬ 
ity  (5.1.2),  conditional  on  the  choice  of  P,  we  obtain 


E  ( || Z || 2)  <  (£  l)q  E Amax (PBB*P)  +  E maxfc  ||  (PB) fc II 2  ■  log d. 


(5.2.3) 


The  notation  (PB)fc  refers  to  the  fcth  column  of  the  matrix  PB.  The  required  calculation  is  anal¬ 
ogous  to  the  one  in  the  Section  5.2.1,  so  we  omit  the  details.  To  reach  a  deterministic  bound,  we 
still  have  two  more  expectations  to  control. 

Next,  we  examine  the  term  in  (5.2.3)  that  involves  the  maximum  eigenvalue: 

d 

£  Tnblbr-  ■ 

U=1 


EAmax(PBB*P)  =  EA 

max  (B*P2B)  =  EAmax 


52 


CHAPTER  5.  A  SUM  OF  RANDOM  POSITIVE-SEMIDEFINITE  MATRICES 


The  first  identity  holds  because  the  nonzero  eigenvalues  of  CC*  equal  the  nonzero  eigenvalues 
of  C*  C  for  any  matrix  C.  Another  application  of  the  matrix  Chernoff  inequality  (5.1.2)  yields 

EAmax(PBB*P)  <  Amax(B*  B)  +  maxj  ||  bj- 1|2  logn.  (5.2.4) 

Recall  that  Amax(B*  B)  =  \\B\\2  to  simplify  this  expression  slightly. 

Last,  we  develop  a  bound  on  the  maximum  column  norm  in  (5.2.3).  This  result  also  follows 
from  the  matrix  Chernoff  inequality,  but  we  need  to  do  a  little  work  to  see  why.  We  are  going  to 
treat  the  maximum  column  norm  as  the  maximum  eigenvalue  of  an  independent  sum  of  random 
diagonal  matrices.  Observe  that 

d 

I \{PB)k\\2=  Y  Vj\bjk\  for  each  /c  =  1 . n. 

j=  i 

Using  this  representation,  we  see  that 


max  ||  (PB)  fc||  —  ^max 
k 


-An 


£ii^M 


Erudiag(lfoy:|2) 

VJ  =  1 


When  applied  to  a  vector,  the  notation  |-|2  refers  to  the  componentwise  modulus  squared.  To  ac¬ 
tivate  the  matrix  Chernoff  bound,  we  need  to  compute  the  two  parameters  that  appear  in  (5.1.2). 
First,  the  upper  bound  parameter  R  satisfies 

R  =  maxJ-Amax(diag(|fo/:|2]]  =  max;  max*  \bjkf. 

Second,  to  compute  the  upper  mean  parameter  pmax,  note  that 


E£>7diag(|fo/:|2) 


p  ,. 
ddlag 


d 

L\H\2 


U=i 


^diag(||Bj:||2) 


which  yields 


P  max  =  -1  niaXj 


bj: 


II 2 


Therefore,  the  matrix  Chernoff  inequality  implies 


Emax||(PB)t||2< 

k 


(e-  1)  p 
d 


+  max|fi;-fc|2-logn. 


(5.2.5) 


On  average,  the  maximum  column  norm  of  a  random  submatrix  PB  with  about  p  nonzero  rows 
gets  its  share  p/d  of  the  maximum  column  norm  of  B,  plus  a  fluctuation  term  that  depends  on 
the  magnitude  of  the  largest  entry  of  B  and  the  logarithm  of  the  number  n  of  columns. 

Combine  the  three  bounds  (5.2.3),  (5.2.4),  and  (5.2.5)  to  reach  the  result  (5.2.2).  We  have 
simplified  numerical  constants  to  make  the  expression  more  compact. 


5.3.  APPLICATION:  WHEN  IS  AN  ERD OS-RENYI  GRAPH  CONNECTED? 
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5.3  Application:  When  is  an  Erdos-Renyi  Graph  Connected? 

Random  graph  theory  concerns  probabilistic  models  for  the  interactions  between  pairs  of  ob¬ 
jects.  One  basic  question  about  a  random  graph  is  to  ask  whether  there  is  a  path  connecting 
every  pair  of  vertices  or  whether  some  vertices  are  segregated  in  different  parts  of  the  graph.  It 
is  possible  to  address  this  problem  by  studying  the  eigenvalues  of  random  matrices,  a  challenge 
that  we  take  up  in  this  section. 

5.3.1  Background  on  Graph  Theory 

Recall  that  an  undirected  graph  is  a  pair  G  =  ( V,  E)  where  V  is  a  set  of  vertices  and  E  is  a  set  of 
edges  connecting  pairs  of  distinct  vertices.  For  simplicity,  we  assume  that  the  vertex  set  V  = 
{1 The  degree  deg(fc)  of  the  vertex  k  is  the  number  of  edges  in  E  that  include  the  vertex  k. 

There  are  some  natural  matrices  associated  with  an  undirected  graph.  The  adjacency  matrix 
of  the  graph  G  is  an  n  x  n  symmetric  matrix  A  whose  entries  indicate  which  edges  are  present: 


We  have  assumed  that  edges  connect  distinct  vertices,  so  the  diagonal  entries  of  the  matrix  A 
equal  zero.  Next,  define  a  diagonal  matrix  D  —  diag(deg(l), . . . ,  deg(n))  whose  entries  list  the  de¬ 
grees  of  the  vertices.  The  Laplacian  and  normalized  Laplacian  of  the  graph  are  the  matrices 


L  -  D- A  and  M  =  D~mLD~112 


We  place  the  convention  that  D~1!2(k,  k)  —  0  when  deg  (A;)  =  0.  The  Laplacian  matrix  L  is  always 
positive  semidehnite.  The  vector  1  of  ones  is  always  an  eigenvector  of  L  with  eigenvalue  zero. 

These  matrices  and  their  spectral  properties  play  a  central  role  in  modern  graph  theory.  For 
example,  the  graph  G  is  connected  if  and  only  if  the  second-smallest  eigenvalue  of  L  is  strictly 
positive.  The  second  smallest  eigenvalue  of  M  controls  the  rate  at  which  a  random  walk  on 
the  graph  G  converges  to  the  stationary  distribution  (under  appropriate  assumptions).  See  the 
book  [GR01]  or  the  website  [But]  for  more  information  about  these  connections. 

5.3.2  The  Model  of  Erdos  and  Renyi 

The  simplest  possible  example  of  a  random  graph  is  the  independent  model  G[n,  p)  of  Erdos  and 
Renyi  [ER60].  The  number  n  is  the  number  of  vertices  in  the  graph,  and  p  e  (0, 1)  is  the  probabil¬ 
ity  that  two  vertices  are  connected.  More  precisely,  here  is  how  to  construct  a  random  graph  in 
G[n,  p).  Between  each  pair  of  distinct  vertices,  we  place  an  edge  independently  at  random  with 
probability  p.  In  other  words,  the  adjacency  matrix  takes  the  form 


(5.3.1) 


The  family  {<5^  :  1  <  j  <  k  <  n}  consists  of  mutually  independent  Bernoulli  (p)  random  vari¬ 
ables.  Figure  5.3.2  shows  one  realization  of  an  Erdos-Renyi  graph. 
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An  Erdos — Renyi  graph  in  G(100,  0.1) 


0  10  20  30  40  50  60  70  80  90  100 

nz  =  972 


Figure  5.1:  The  adjacency  matrix  of  an  Erdos-Renyi  graph.  This  figure  shows  the  pattern  of 
nonzero  entries  in  the  adjacency  matrix  A  of  a  random  graph  drawn  from  G(100,0.1).  Out  of 
a  possible  4,950  edges,  there  are  486  edges  present.  A  basic  question  is  whether  the  graph  is 
connected.  The  graph  is  disconnected  if  and  only  if  there  is  a  permutation  of  the  coordinates 
so  that  the  adjacency  matrix  is  block  diagonal.  This  property  is  reflected  in  the  second-smallest 
eigenvalue  of  A. 


Let  us  explain  how  to  represent  the  adjacency  matrix  and  Laplacian  matrix  of  an  Erdos-Renyi 
graph  as  a  sum  of  independent  random  matrices.  The  adjacency  matrix  A  of  a  graph  in  G[n,  p) 
can  be  written  as 

A  =  £  djk(Ejk  +  Ekj).  (5.3.2) 

1  <j<k<n 

This  expression  is  a  straightforward  translation  of  the  definition  (5.3.1)  into  matrix  form.  Simi¬ 
larly,  the  Laplacian  matrix  L  can  be  expressed  as 

L=  Y  SjkIEjj  +  Eicic-Ejk-'Ekj).  (5.3.3) 

1  <;<fc<n 

To  verify  the  formula  (5.3.3),  observe  that  the  presence  of  an  edge  between  the  vertices  j  and  k 
increases  the  degree  of  j  and  k  by  one.  Therefore,  when  Sjk=  1,  we  augment  the  (  /,  j)  and  (/c,  k) 
entries  of  L  to  reflect  the  change  in  degree,  and  we  mark  the  (  /,  k)  and  (k,  j )  entries  with  - 1  to 
reflect  the  presence  of  the  edge  between  j  and  k. 

5.3.3  Connectivity  of  an  Erdos-Renyi  Graph 

We  will  obtain  a  near-optimal  bound  for  the  range  of  parameters  where  an  Erdos-Renyi  graph 
G{n,  p)  is  likely  to  be  connected.  We  can  accomplish  this  goal  by  showing  that  the  second  small- 
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est  eigenvalue  of  the  nx  n  random  Laplacian  matrix  L  —  D  -  A  is  strictly  positive.  We  will  solve 
the  problem  by  using  the  matrix  Chernoff  inequality  to  study  the  second-smallest  eigenvalue  of 
the  random  Laplacian  L. 

We  need  to  form  a  random  matrix  Y  that  consists  of  independent  positive-semidefinite  terms 
and  whose  minimum  eigenvalue  coincides  with  the  second-smallest  eigenvalue  of  L.  To  that 
end,  define  an  [n  -  1)  x  n  partial  unitary  matrix  R  that  restricts  a  vector  to  the  orthogonal  com¬ 
plement  of  the  vector  1  of  ones.  That  is,  the  rows  of  R  form  an  orthonormal  family  and  the  null 
space  of  R  is  the  vector  1.  Now,  consider  the  random  matrix 

Y  =  RLR*  =  £  8jk-R{Ejj+Ekk-Ejk-Ekj)R*.  (5.3.4) 

1  <j'<fc<n 

Recall  that  {8jk}  is  an  independent  family  of  BERNOULLi(p)  random  variables,  so  the  summands 
are  mutually  independent.  The  Conjugation  Rule  (2.1.4)  ensures  that  each  summand  remains 
positive-semidefinite.  Since  1  is  an  eigenvector  with  eigenvalue  zero  associated  with  the  positive- 
semidefinite  matrix  L,  the  minimum  eigenvalue  of  Y  coincides  with  the  second-smallest  eigen¬ 
value  of  L. 

To  apply  the  matrix  Chernoff  inequality,  we  need  a  uniform  upper  bound  B  on  the  eigenval¬ 
ues  of  the  summands.  We  have 

B  <  1 8jk  ■  R(Ej ]  +  Ekk  -  Ejk  -  Ekj)R*  || 

—  jk  \  ■  ||  ■  |  Ejj  +  Ekk  —  Ejk  —  Ekj  ||  •  ||  R  ||  =  2. 

The  first  bound  follows  from  the  submultiplicativity  of  the  spectral  norm.  To  obtain  the  second 
bound,  note  that  8jk  takes  0-1  values.  The  matrix  R  is  a  partial  isometry  so  its  norm  equals  one. 
Finally,  a  direct  calculation  shows  that  T  =  Eii  +  Ekk-Ejk-Ekj  satisfies  the  polynomial  T2  —  2 T, 
so  the  eigenvalues  of  T  must  equal  zero  and  two. 

Next,  we  compute  the  expectation  of  the  matrix  Y. 


E  Y  =  pR 


Y  (E;j  +  Efcfc  E  jk  E  kj) 

1<  j<k<n 


R* 


=  p-R[{n-l)ln- (11*  -!„)]**  -  pnl„-i. 


The  first  identity  follows  when  we  apply  linearity  of  expectation  to  (5.3.4)  and  then  use  linearity 
of  matrix  multiplication  to  draw  the  sum  inside  the  conjugation  by  R.  The  term  [n— 1)  I„  emerges 
when  we  sum  the  diagonal  matrices.  The  term  11*  -I„  comes  from  the  off-diagonal  matrix  units, 
once  we  note  that  the  matrix  1 1  *  has  one  in  each  component.  The  last  identity  holds  because  R 
annihilates  the  vector  1,  while  RR*  -  In-i-  We  conclude  that 


Mmin(E)  —  AminfEl*)  —  pn. 


This  is  all  the  information  we  need. 

Invoke  the  tail  bound  (5.1.5)  to  obtain,  for  e  e  (0, 1), 


°{a!,(L)  <  e-pnj-  =P{Amin (Y)<e-pn}<  [n-  1) 


pn/2 


To  appreciate  what  this  means,  we  may  think  about  the  situation  where  e  — >■  0.  Then  the  bracket 
tends  to  e-1,  and  we  see  that  the  second  smallest  eigenvalue  of  L  is  unlikely  to  be  zero  when 
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log (n  -  1)  -  p n / 2  <  0.  Rearranging  this  expression,  we  obtain  a  sufficient  condition 

2  log(w-  1) 

p> - 

n 

for  an  Erdos-Renyi  graph  G(n,  p )  to  be  connected  with  high  probability  as/t^oo.  This  bound  is 
quite  close  to  the  optimal  result,  which  lacks  the  factor  two  on  the  right-hand  side.  It  is  possible 
to  make  this  reasoning  more  precise,  but  it  does  not  seem  worth  the  fuss. 


5.4  Proof  of  the  Matrix  Chernoff  Inequalities 


The  first  step  toward  the  matrix  Chernoff  inequalities  is  to  develop  an  appropriate  semidefinite 
bound  for  the  mgf  and  cgf  of  a  random  positive-semidefinite  matrix.  The  method  for  establishing 
this  bound  mimics  the  proof  in  the  scalar  case:  we  simply  bound  the  exponential  with  a  linear 
function. 


Lemma  5.4.1  (Matrix  Chernoff:  Mgf  and  Cgf  Bound).  Suppose  that  X  is  a  random  positive- 
semidefinite  matrix  that  satisfies  Amax  (X )  <  R.  Then 


Ee0x  =<:  exp 


‘eRe-  1 
R 


•(EX) 


eRB  - 1 

and  log  Eeox  =4 - (EX)  for 6  e  05. 

R 


Proof.  Consider  the  function  /(x)  =  eSx.  Since  /  is  convex,  its  graph  lies  below  the  chord  con¬ 
necting  two  points.  In  particular, 


m*m+WUB.x  for x e  [0,R]. 


In  detail, 


e0x  <  1 + 


R 


e®-l 


R 


•x  for  xe  [0,1?]. 


By  assumption,  each  eigenvalue  of  X  lie  in  the  interval  [0,1?].  Thus,  the  Transfer  Rule  (2.1.6) 
implies  that 


eBX  =<:!+■ 


eR0-l 


R 


X. 


Expectation  respects  the  semidefinite  order,  so 

•  (EX)  exp 


Ee0x^I  + 


eRe  - 1 


R 


eRB  - 1 
R 


■(EX) 


The  second  relation  is  a  consequence  of  the  fact  that  I  +  A  =4  eA  for  every  matrix  A,  which  we 
obtain  by  applying  the  Transfer  Rule  (2.1.6)  to  the  inequality  1  +  a  <  ea,  valid  for  all  a  e  M. 

To  obtain  the  semidefinite  bound  for  the  cgf,  we  simply  take  the  logarithm  of  the  semidef¬ 
inite  bound  for  the  mgf.  This  operation  preserves  the  semidefinite  order  because  of  the  prop¬ 
erty  (2.1.9)  that  the  logarithm  is  operator  monotone.  □ 


We  break  the  proof  of  the  matrix  inequality  into  two  pieces.  First,  we  establish  the  bounds 
on  the  maximum  eigenvalue,  which  are  slightly  easier.  Afterward,  we  develop  the  bounds  on  the 
minimum  eigenvalue. 
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Proof  of  Theorem  5.1.1,  Maximum  Eigenvalue  Bounds.  Consider  a  finite  sequence  {XQ  of  inde¬ 
pendent,  random  Hermitian  matrices  that  satisfy 

Xk  !>=  0  and  Arnfix(X/J  <  R  for  each  index  k. 

The  cgf  bound,  Lemma  5.4. f,  states  that 


eR6_  i 

logEe0^  ^g(0)(EXt)  where  g(0)  = -  for0>O.  (5.4.1) 

R 

We  begin  with  the  upper  bound  (5.1.2)  for  EAmax(F).  Using  the  fact  (2.1.7)  that  the  trace  of  the 
exponential  function  is  monotone  with  respect  to  the  semidefinite  order,  we  substitute  these  cgf 
bounds  into  the  master  inequality  (3.6.1)  for  the  maximum  eigenvalue  to  reach 

EA max(F)  <  inf  ^  logtrexp(g(0)£fcEXfc) 

6> 0  U 

<  inf  ^  log  [  d  Amax  (exp  (g (0)  (E  F)) )  ] 

0>O  u 

=  inf  \  log  [d  exp  (Amax  (g  (0)  (E  F)))  ] 

6> 0  U 

=  inf  \  l°g  [d  exp  (g(0)  •  ^max  (EF))] 

0>O  c / 

=  inf  \  [logd  +  g(0)-pmax] . 

0>O  u 

In  the  second  line,  we  use  the  fact  that  the  matrix  exponential  is  positive  definite  to  bound  the 
trace  by  d  times  the  maximum  eigenvalue.  We  have  also  identified  the  sum  as  E  Y.  The  third  line 
follows  from  the  Spectral  Mapping  Theorem,  Proposition  2.1.3.  Next,  we  use  the  fact  (2.1.2)  that 
the  maximum  eigenvalue  map  is  positive  homogeneous,  which  depends  on  the  observation  that 
g  (0)  >  0  for  0  >  0.  Finally,  we  identify  the  quantity  pmax,  defined  in  (5. 1 . 1) .  The  infimum  does  not 
admit  a  closed  form,  but  we  can  obtain  the  expression  (5.1.2)  by  making  the  change  of  variables 
6~6/R. 

Next,  we  turn  to  the  upper  bound  (5.1.4)  for  the  upper  tail  of  the  maximum  eigenvalue.  Sub¬ 
stitute  the  cgf  bounds  (5.4.1)  into  the  master  inequality  (3.6.3)  to  reach 

P{Amax(F)  >  f}  <  inf  e~et  trexp(g(0)£fcEXt) 
d>0 

£  inf  e~et-d  exp  (g(0)  •  pmax) . 

The  steps  here  are  identical  with  the  previous  argument.  Make  the  change  of  variables  t  >-»■  (1  + 
S)fi max-  The  infimum  is  achieved  at  0  -  1  log(l  +  8),  which  leads  to  the  tail  bound  (5.1.4).  □ 

The  lower  bounds  follow  from  a  related  argument  that  is  slightly  more  delicate. 


Proof  of  Theorem  5.1.1,  Minimum  Eigenvalue  Bounds.  Once  again,  consider  a  finite  sequence  \XQ 
of  independent,  random  Hermitian  matrices  that  satisfy 

Xfc  0  and  Amax(Xr)  <  R  for  each  index  k. 
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The  cgf  bound,  Lemma  5.4.1,  states  that 


log  EeeXk  4  g(0)  ■  (EXfc)  where  g(0)  = 


aRB 


-  1 


R 


for  0  <  0. 


(5.4.2) 


Note  that  g(0)  <  0  for  6  <  0,  which  alters  a  number  of  the  steps  in  the  argument. 

We  commence  with  the  lower  bound  (5.1.3)  for  EAri,in(F).  As  stated  in  (2.1.7),  the  trace  ex¬ 
ponential  function  is  monotone  with  respect  to  the  semidefinite  order,  so  the  master  inequal¬ 
ity  (3.6.2)  for  the  minimum  eigenvalue  delivers 

EAmin(X)  >  sup  \  logtrexp(g(0)£j.EXfc) 

0<o  11 

—  sup  \  log  [dAmax(exp(g(0)-(EF)))] 
e<o  u 

—  sup  log  [dexp(Amax(g(0)-(EF)))] 

0<o  & 

—  sup  \  log  [cl  exp  (g(0)  -Amin  (EF))] 

0<O  U 

=  sup  \  [logd  +  g(0)  •  Minin]  • 

0<O  O 

Most  of  the  steps  are  the  same  as  in  the  proof  of  the  upper  bound  (5.1.2),  so  we  focus  on  the 
differences.  Since  the  factor  0~l  in  the  first  and  second  lines  is  negative,  upper  bounds  on  the 
trace  reduce  the  value  of  the  expression.  We  move  to  the  fourth  line  by  invoking  the  property 
Arnax(aA)  =  aAmjn(A)  for  a  <  0,  which  follows  from  (2.1.2)  and  (2.1.3).  This  piece  of  algebra 
depends  on  the  fact  that  g(0)  <  0  when  6  <  0.  To  obtain  the  result  (5.1.3),  we  change  variables: 

e^-e/R. 

Finally,  we  establish  the  bound  (5.1.5)  for  the  lower  tail  of  the  minimum  eigenvalue.  Intro¬ 
duce  the  cgf  bounds  (5.4.2)  into  the  master  inequality  (3.6.4)  to  reach 


P  lAmin(F)  <  f}  <  inf  e 

0<O 

<  inf  e 

0<O 


-ertrexp(g(0)£fcEXt) 
~et  ■  d  exp  [g{6)  ■  V-min)  . 


The  justifications  here  match  those  in  with  the  previous  argument.  Make  the  change  of  variables 
r  ■ — ►  (1  —  <5)/imin.  The  infimum  is  attained  at  0  -  1  log(l  -  5),  which  yields  the  tail  bound  (5.1.5). 

□ 


5.5  Notes 

As  usual,  we  continue  with  an  overview  of  background  references  and  related  work. 

5.5. 1  Matrix  Chernoff  Inequalities 

Scalar  Chernoff  inequalities  date  to  the  paper  [Che52,  Thm.  1]  by  Herman  Chernoff.  The  original 
result  provides  probability  bounds  for  the  number  of  successes  in  a  sequence  of  independent  but 
non-identical  Bernoulli  trials.  Chernoff’s  proof  combines  the  scalar  Laplace  transform  method 
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with  refined  bounds  on  the  mgf  of  a  Bernoulli  random  variable.  It  is  very  common  to  encounter 
simplified  versions  of  Chernoff’s  result,  such  as  [Lug09,  Exer.  8]  or  [MR95,  §4.1]. 

In  their  paper  [AW02],  Ahlswede  and  Winter  developed  a  matrix  version  of  the  Chernoff  in¬ 
equality.  The  matrix  mgf  bound,  Lemma  5.4.1,  essentially  appears  in  their  work.  Ahlswede- 
Winter  focus  on  the  case  of  iid  random  matrices,  in  which  case  their  results  are  comparable  with 
Theorem  5.1.1.  For  the  general  case,  their  approach  leads  to  mean  parameters  of  the  form 

Mmax  ~  ^-max(E-Xfc)  and  /imin  =  T'.i-  Amin(EXfc). 

It  is  clear  that  these  mean  parameters  may  be  substantially  inferior  to  the  mean  parameters  /imax 
and  p min  that  we  defined  in  Theorem  5.1.1. 

The  tail  bounds  from  Theorem5.1.1  are  drawn  from  [Trolld,  §5],  butthe  expectation  bounds 
we  present  are  new.  The  paper  [GT11]  extends  the  matrix  Chernoff  inequality  to  provide  upper 
and  lower  tail  bounds  for  all  eigenvalues  of  a  sum  of  positive-semidefinite  random  matrices. 
Finally,  Chapter  7  contains  a  slight  improvement  of  the  upper  bounds  from  Theorem  5.1.1. 

5.5.2  Random  Submatrices 

The  problem  of  studying  a  random  submatrix  drawn  from  a  fixed  matrix  has  a  long  history.  An 
early  example  is  the  paving  problem  from  operator  theory,  which  asks  for  a  well-conditioned  set 
of  columns  (or  a  well-conditioned  submatrix)  inside  a  fixed  matrix.  Random  selection  provides 
a  natural  approach  to  this  question.  The  papers  [BT87,  BT91,  KT94]  study  random  paving  using 
sophisticated  tools  from  functional  analysis.  See  the  paper  [NT12]  for  a  summary  of  research  on 
the  paving  problem. 

Later,  Rudelson  and  Vershynin  [RV07]  showed  that  the  noncommutative  Khintchine  inequal¬ 
ity  provides  a  clean  way  to  bound  the  norm  of  a  random  column  sub  matrix  (or  a  random  row  and 
column  submatrix)  drawn  from  a  fixed  matrix.  Their  ideas  have  found  many  applications  in  the 
mathematical  signal  processing  literature.  See,  for  example,  the  paper  [Tro08a],  The  same  ap¬ 
proach  led  to  the  work  [Tro08c],  which  contains  a  new  proof  of  [BT91,  Thm.  2.1]. 

The  article  [Trolle]  contains  the  observation  that  the  matrix  Chernoff  inequality  is  an  ideal 
tool  for  studying  random  submatrices.  It  applies  this  technique  to  study  a  random  matrix  that 
arises  in  numerical  linear  algebra,  and  it  achieves  the  optimal  estimate  for  this  problem.  Our 
analysis  of  a  random  column  submatrix  is  based  on  this  work.  The  analysis  of  a  random  row 
and  column  submatrix  is  new.  The  paper  [CD12] ,  by  Chretien  and  Darses,  uses  matrix  Chernoff 
bounds  in  a  more  sophisticated  way  to  develop  tail  bounds  for  the  norm  of  a  random  row  and 
column  submatrix. 

5.5.3  Random  Graphs 

The  analysis  of  random  graphs  and  random  hypergraphs  appeared  as  one  of  the  earliest  applica¬ 
tions  of  matrix  concentration  inequalities  [AW02] .  Christofides  and  Markstrom  developed  a  ma¬ 
trix  Hoeffding  inequality  to  aid  in  this  purpose  [CM08],  Later,  Oliveira  wrote  two  papers  [OlilOa, 
Olill]  on  random  graph  theory  based  on  matrix  concentration.  We  recommend  these  works  for 
further  information. 

The  device  we  have  used  to  analyze  the  second  smallest  eigenvalue  of  a  random  graph  Lapla- 
cian  can  be  extended  to  obtain  tail  bounds  for  all  the  eigenvalues  of  a  sum  of  independent  ran¬ 
dom  matrices.  See  the  paper  [GT1 1]  for  a  development  of  this  idea. 


A  Sum  of  Bounded 
Random  Matrices 


In  this  chapter,  we  describe  matrix  concentration  inequalities  that  generalize  the  classical  Bern¬ 
stein  bound.  The  matrix  Bernstein  inequalities  concern  a  random  matrix  formed  as  a  sum  of 
independent,  bounded  random  matrices.  The  results  allow  us  to  study  how  much  a  random 
matrix  deviates  from  its  mean  value  in  the  spectral  norm. 

To  be  rigorous,  let  us  suppose  that  Xi,...,Xn  are  independent  random  matrices  with  the 
properties 

EXfc  -  0  and  Amax(Xj;)  <  R  for  each  k- 

Form  the  sum  Y  =  Y.k^k-  The  matrix  Bernstein  inequality  allows  us  to  study  the  expectation  and 
tail  behavior  of  Amax(F)  in  terms  of  the  variance  E(F2). 

Matrix  Bernstein  inequalities  have  a  much  wider  scope  of  application  than  the  last  paragraph 
might  suggest.  First,  if  the  summands  are  not  centered,  we  can  subtract  off  the  mean  and  use  the 
matrix  Bernstein  method  to  obtain  information  about  Amax(F  -  EF).  Second,  we  can  obtain 
bounds  for  the  minimum  eigenvalue  Am;n(F)  by  applying  the  matrix  Bernstein  bounds  to  -F. 
Third,  we  can  extend  the  result  to  study  the  spectral  norm  of  a  sum  of  independent,  general 
random  matrices  that  satisfy  a  uniform  norm  bound. 

In  these  pages,  we  can  only  give  a  coarse  indication  of  how  researchers  have  used  the  matrix 
Bernstein  inequality.  We  have  selected  two  typical  examples  from  the  literature  on  randomized 
matrix  approximation.  First,  we  explain  how  to  develop  a  randomized  algorithm  for  approximate 
matrix  multiplication,  and  we  establish  an  error  bound  for  this  method.  Second,  we  consider  the 
technique  of  randomized  sparsification,  in  which  we  replace  a  dense  matrix  with  a  sparse  proxy 
that  has  similar  spectral  behavior.  There  are  many  other  examples,  some  of  which  appear  in  the 
annotated  bibliography. 

Altogether,  the  matrix  Bernstein  inequality  is  a  powerful  tool  with  a  huge  number  of  appli¬ 
cations.  It  is  particularly  effective  for  studying  randomized  approximations  of  a  given  matrix. 
Nevertheless,  let  us  emphasize  that,  when  the  matrix  Chernoff  inequality,  Theorem  5.1.1,  hap¬ 
pens  to  apply,  it  often  delivers  better  results  for  a  given  problem. 
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Section  6.1  describes  the  Bernstein  inequality  for  Hermitian  matrices,  and  §6.2  presents  the 
adaptation  to  general  matrices.  Afterward,  in  §§6. 3-6.4,  we  continue  with  the  two  random¬ 
ized  approximation  examples.  We  conclude  with  the  proof  of  the  matrix  Bernstein  inequalities 
in  §6.5. 

6.1  A  Sum  of  Bounded  Hermitian  Matrices 

In  the  scalar  setting,  there  are  a  large  number  of  concentration  bounds  that  fall  under  the  head¬ 
ing  “Bernstein  inequality.”  Most  of  these  bounds  have  extensions  to  matrices.  For  simplicity,  we 
focus  on  the  most  famous  of  them  all,  a  tail  bound  for  the  sum  Y  of  independent,  zero-mean 
random  variables  that  are  subject  to  a  uniform  upper  bound.  In  this  case,  the  Bernstein  inequal¬ 
ity  shows  that  the  sum  Y  concentrates  around  its  mean  value.  For  moderate  deviations,  the  sum 
behaves  like  a  normal  random  variable  with  the  same  variance  as  Y.  For  large  deviations,  the 
sum  has  tails  that  decay  at  least  as  fast  as  an  exponential  random  variable. 

In  analogy,  the  matrix  Bernstein  inequality  concerns  a  sum  of  independent,  zero-mean  Her¬ 
mitian  matrices  whose  eigenvalues  are  bounded  above.  The  theorem  demonstrates  that  the 
maximum  eigenvalue  of  the  sum  acts  much  like  the  scalar  random  variable  Y  that  we  discussed 
in  the  last  paragraph. 

Theorem  6.1.1  (Matrix  Bernstein:  Hermitian  Case).  Consider  a  finite  sequence  [XU  of  indepen¬ 
dent,  random,  Hermitian  matrices  with  dimension  d.  Assume  that 


EJCfc  -  0  and  Amax(A^)  <  R. 


Introduce  the  random  matrix 


Y  =  LkXlc- 


Compute  the  variance  parameter 


o‘ 


.2 


=  CT2(n  =  |E(y2)|. 


(6.1.1) 


Then 


(6.1.2) 


Furthermore,  for  all  t  >  0. 


PfAmax  (B)  >  t]  <  d-exp 


a2  +  Rt/3 


(6.1.3) 


The  proof  of  Theorem  6.1.1  appears  below  in  §6.5. 


6.1.1  Discussion 

Let  us  spend  a  few  moments  to  discuss  this  result  and  its  implications.  First,  observe  that  we  can 
express  the  variance  parameter  (6.1.1)  in  terms  of  the  summands: 
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The  second  relation  holds  because  the  summands  are  independent,  and  each  one  has  zero  mean. 
This  identity  parallels  the  scalar  result  that  the  variances  of  a  sum  of  independent  random  vari¬ 
ables  is  the  sum  of  the  variances. 

The  expectation  bound  (6.1.2)  shows  that  the  expectation  of  Amax(F)  is  on  the  same  scale  as 
the  standard  deviation  a  and  the  upper  bound  R  on  the  summands;  there  is  also  a  weak  depen¬ 
dence  on  the  ambient  dimension  d.  In  general,  all  three  of  these  features  are  necessary. 

Next,  let  us  interpret  the  tail  bound  (6.1.3).  The  only  difference  between  this  result  and  the 
scalar  Bernstein  bound  is  the  addition  of  the  dimensional  factor  d,  which  reduces  the  range  of  t 
where  the  inequality  is  informative.  To  get  a  better  idea  of  what  this  result  means,  it  is  helpful  to 
make  a  further  estimate: 


p{Amaxcn>f}< 


d-exp(-3f2/8cr2),  t<(J2IR 
d-exp(-3t/8f?),  t>cr2/R. 


(6.1.4) 


In  other  words,  for  moderate  values  of  t,  the  tail  probability  decays  as  fast  as  the  tail  of  a  Gaussian 
random  variable  with  variance  4cr2  / 3.  For  larger  values  of  t,  the  tail  probability  decays  at  least  as 
fast  as  that  of  an  exponential  random  variable  with  mean  4i?/3. 

Next,  we  point  out  that  Theorem  6.1.1  also  yields  information  about  the  minimum  eigenvalue 
of  an  independent  sum  of  d-dimensional  Hermitian  matrices.  In  this  case,  we  must  assume  that 


EXj;  —  0  and  Amjn  IXk)  >  —R. 


Form  the  random  matrix  Y  =  By  applying  the  expectation  bound  (6.1.2)  to  -  Y,  we  obtain 

EAmjn(F)  >  —  \J , 2o2 log d  —  ^fflogd  (6.1.5) 

where  cr2  =  cr2(F).  We  can  use  (6.1.3)  to  develop  a  tail  bound.  For  t  >  0, 

I  -t2/2 

PUminCF)  < -t}<  d-exp  -5 — — — 

(cm  +  Rt/3 

Let  us  emphasize  that  the  bounds  for  Amax(F)  and  the  bounds  for  Am;n(F)  may  diverge  because 
the  two  parameters  R  and  R  can  take  sharply  different  values. 

Finally,  it  is  important  to  recognize  that  the  matrix  Bernstein  inequality  applies  just  as  well 
to  uncentered  matrices.  Consider  a  finite  sequence  {X^}  of  independent,  random  Hermitian 
matrices  with  dimension  d.  Assume  that  each  matrix  satisfies  the  bound 


Amax(Affc-E  Xk)<R. 

Introduce  the  sum  F  =  Y.kX-k,  and  compute  the  variance  parameter 

ct2  =  ct2(F)=  ||E((F-EF)2)||  =  ||£fcE((Xfc-EXt)2) 

Then  we  have  the  expectation  bound 

1 


EAmax(F-  EF)  <  y2(72logd+  -R\ogd. 


P{Amax(F-EF)>r}<  d-exp 


-t2 12  \ 

a2  +  Rt/3 ) 


Furthermore,  for  t  >  0, 
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Similar  results  hold  for  Amjn(F),  as  discussed  in  the  previous  paragraph. 

There  are  many  other  types  of  matrix  Bernstein  inequalities.  For  example,  we  can  sharpen 
the  tail  bound  (6.1.3)  to  obtain  a  matrix  Bennett  inequality.  We  can  also  relax  the  boundedness 
assumption  to  a  weaker  hypothesis  on  the  growth  of  the  moments  of  each  summand  X f.  See  the 
notes  at  the  end  of  this  chapter  and  the  annotated  bibliography  for  more  information. 


6.2  A  Sum  of  Bounded  Rectangular  Matrices 

The  matrix  Bernstein  inequality  admits  an  extension  to  a  sum  of  general  random  matrices  that 
are  subject  to  a  uniform  norm  bound.  This  result  turns  out  to  be  a  formal  consequence  of  the 
Hermitian  result,  Theorem  6.1.1,  even  though  it  may  initially  seem  more  powerful. 

Corollary  6.2.1  (Matrix  Bernstein:  General  Case).  Consider  a  finite  sequence  {Sk}  of  independent, 
random  matrices  with  dimension  d\  x  d2-  Assume  that 

E  Sk  =  0  and  ||  Sk  II  <  R. 


Introduce  the  random  matrix 

z  =  LkSk- 

Compute  the  variance  parameter 

cr2  =  cr2(Z)  =  max{||E(ZZ*)||,  ||E(Z*Z)||}.  (6.2.1) 

Then 

E IIZH  <  \J 2o2\og(d\  +  d2)  +  ii?log(di  +  d2).  (6.2.2) 

Furthermore,  for  all  t  >  0, 


(  t2 12  \ 

P{IIZ||  >  t]  <  (d1  +  d2)  exp - ^ — TTTo  • 

V  oz  +  Rt/3) 

The  proof  of  Corollary  6.2.1  appears  in  §6.5. 

6.2.1  Discussion 

The  general  case  is  similar  with  the  Hermitian  case,  Theorem  6.1.1,  in  many  respects.  Corol¬ 
lary  6.2.1  also  has  a  lot  in  common  with  Corollary  4.2.1,  concerning  a  Gaussian  series  with  gen¬ 
eral  matrix  coefficients.  As  a  consequence,  we  do  not  indulge  in  an  extensive  commentary. 

First,  let  us  express  the  variance  parameter  (6.2.1)  in  terms  of  the  summands: 

cr2(Z)  =  max{||E(ZZ*)|| ,  ||E(Z*Z)||} 

=  max{||E(X;,tS;St)||,||E(LMS;Sfc)||} 

=  max{||i:fcE(SfcS*)||,  ||L*E(S£Sit)||}. 

As  usual,  the  last  relation  holds  because  the  summands  are  independent,  zero-mean  random 
matrices. 
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The  version  of  Corollary  6.2.1  for  uncentered  matrices  is  important  enough  that  we  lay  out 
the  details.  Consider  a  finite  sequence  {Sfc}  of  independent,  random  matrices  with  dimension 
d\  x  d2-  Assume  that  each  matrix  satisfies  the  bound 


HSjfc-ESfcll  <  R. 


Introduce  the  sum  Z  -Y.kSk>  and  compute  the  variance  parameter 

£72  =  max{||E((Z-EZ)(Z-EZ)*)||,  ||E((Z-EZ)*(Z-EZ))||} 

=  max{|£j.E((Sfc-ESfc)(Sfc-ESfc)*)|| ,  ||^fcE((Sjt - ESk)* (S*  - ES*)) ||} . 

Then  we  have  the  expectation  bound 

E  || Z  —  EZ||  <  \J 2 a2  log(dj  +  ry  +  -.RlogCdi  +  (I2). 

Furthermore,  for  t  >  0, 


(  -t2l  2 

P {||Z  —  EZ||  >  t}  <  (di  +  d2)  •  exp  -5 — — — 

\erz  +  Rtl3 

The  results  in  this  paragraph  are  probably  the  most  commonly  used  versions  of  the  matrix  Bern¬ 
stein  bounds. 

6.3  Application:  Randomized  Sparsification  of  a  Matrix 

Many  tasks  in  data  analysis  require  spectral  computations  on  large,  dense  matrices.  Yet  many 
spectral  decomposition  algorithms  operate  most  efficiently  on  sparse  matrices.  If  we  can  tolerate 
approximate  results,  we  may  be  able  to  reduce  the  computational  cost  by  replacing  the  original 
dense  matrix  with  a  sparse  proxy  that  has  a  similar  spectrum.  An  elegant  way  to  identify  the 
sparse  proxy  is  to  randomly  zero  out  entries  of  the  original  matrix.  In  this  example,  we  examine 
the  performance  of  one  such  approach. 

Let  B  be  a  fixed  d\  x  rf2  complex  matrix.  Write  L  =  maxj^hji^  for  the  maximum  absolute 
entry  of  the  matrix.  Fix  a  sparsification  parameter  p  e  (0, 1) ,  and  consider  a  family  of  independent 
Bernoulli  random  variables: 

p \bjk\ 

<S/jt~BERNOULLi(p/(t)  where  p]k  =  — — , - . 

P\bjk\+L 

It  is  easy  to  verify  that  0  <  Pjk  <  1.  so  this  is  a  legitimate  probability.  Draw  a  random,  sparse 
matrix  Z  with  entries 


zjk  ~  8 jk 


bjk_ 

Pjk 


for  j  =  and  1c  —  1, . . . , dz- 


In  other  words,  we  zero  out  small  entries  with  high  probability,  and  we  zero  out  larger  entries 
with  low  probability.  We  rescale  the  entries  that  we  keep  to  compensate.  Observe  that  ElZ^)  = 
bjk,  so  that  the  random  sparse  matrix  Z  is  an  unbiased  approximation  to  the  original  matrix  B. 
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We  must  assess  the  typical  sparsity  of  the  random  matrix  Z,  and  we  must  bound  the  distance 
between  Z  and  the  original  matrix  B  in  the  spectral  norm.  An  elementary  calculation  shows  that 
the  expected  number  of  nonzero  entries  in  Z  is  at  most 


j.k  j,k 


P\b]k\ 

P\bjk\+L 


<  p-did2. 


So  the  parameter  p  is  a  bound  for  the  proportion  of  nonzero  entries  appearing  in  the  reduced 
matrix.  We  will  show  that  the  expected  approximation  error  satisfies 


E||Z-B||  / 2Lmax{di,  d2}\og(di  +  d2)  2Llog(di  +  d2) 

IIBII  V  pllBII2  3p||B|| 

Ignoring  the  logarithmic  factors,  we  learn  that  it  is  possible  to  construct  a  sparse  matrix  that 
approximates  B  with  a  small  relative  error,  provided  that 

L«||B||  and  L  max{rfi,  d2}  «  ||B||2 


Matrices  whose  largest  entries  are  relatively  small  as  compared  with  the  norm  are  natural  candi¬ 
dates  for  this  type  of  processing. 


The  Analysis 

We  will  use  the  matrix  Bernstein  inequality  to  study  how  well  the  sparsification  procedure  pre¬ 
serves  the  spectral  properties  of  the  original  matrix.  For  reference,  we  calculate  the  mean  and 
variance  of  the  entries  of  Z: 


E zjk  =  bjic  and  E\Zjk-bjk\2 


L 

P 


It  follows  that  EZ  =  B.  Define  the  error  matrix  E  —  Z  -  B,  and  write 


E  -  Y, ( zjk  bjk )  E jk  -  X  s]k> 

j.k  j.k 


where  the  expression  above  defines  the  summands  Sjk.  It  is  immediate  that  each  summand 
satisfies  ES^  =  0  and  that  {Sjfc}  is  an  independent  family. 

To  apply  the  matrix  Bernstein  inequality,  we  first  observe  that  the  summands  satisfy  a  uni¬ 
form  bound: 

Pjk  P  P 

Determining  the  variance  of  the  error  matrix  E  takes  a  little  more  work.  We  have 


e  (sjks%)  =  [e|  Zjk 
LE(s,-fcs;fc) 

j,k 


L 

P 


||  d2 1^ 


d2L 

P 


It  follows  that 
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Similarly, 

|§E(sks/‘)||  =  T' 

We  conclude  that  the  variance  (6.2.1)  of  the  error  matrix 

,  L 

a  [E)  =  —  max{di,  }■ 
P 

The  expectation  bound  (6.2.2)  from  Corollary  6.2.1  delivers 


E II.EII  < 


2Lmax{di,  tfeHogMi  +  dz) 


2Llog(di  +  dz) 

3  P 


The  result  (6.3.1)  is  a  direct  consequence  of  this  inequality. 


6.4  Application:  Randomized  Matrix  Multiplication 


Over  the  last  decade,  randomized  algorithms  have  started  to  play  an  important  role  in  numerical 
linear  algebra.  One  of  the  basic  tasks  in  linear  algebra  is  to  multiply  two  matrices  with  compatible 
dimensions.  Suppose  that  B  is  a  d\  x  N  complex  matrix  and  that  C  is  an  N  x  d2  complex  matrix, 
and  we  wish  to  compute  the  product  BC.  The  straightforward  algorithm  forms  the  product  entry 
by  entry: 

N 

(BC)ik  —  bijcjk  for  each  i  =  l,...,di  and  fc=  l,...,d2-  (6.4.1) 

)=i 

This  approach  takes  0{Nd\d2)  arithmetic  operations.  There  are  algorithms,  such  as  Strassen’s 
divide-and-conquer  method,  that  can  reduce  the  cost,  but  these  approaches  are  not  considered 
practical  for  most  applications. 

In  certain  circumstances,  we  can  accelerate  matrix  multiplication  using  randomized  meth¬ 
ods.  The  key  to  this  approach  is  to  view  the  matrix  product  as  a  sum  of  outer  products: 


N 

BC=Yj  b-kCk  - 


k=  1 


(6.4.2) 


Next,  we  reinterpret  this  sum  as  the  expectation  of  a  random  matrix.  It  takes  some  care  to  do  this 
properly.  Define  a  set  of  probabilities 


Pi  ~ 


■Kill  lb: 


r;=iiiMii 

Now,  we  introduce  a  random  matrix  R  with  distribution 
1 


for  j  =  1,2,..., AT. 


R  -  —  b  jCj-  with  probability  pj  for  each  j  =  1 

Pi 


N 

ER  =  ^  b:jcj:  =  BC- 
7=1 


It  follows  that 
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Therefore,  we  can  regard  R  as  a  randomized  proxy  for  the  product  BC.  This  estimator  is  unbi¬ 
ased,  but  the  variance  may  be  intolerable.  To  obtain  a  more  precise  estimate  for  the  product,  we 
can  average  n  independent  copies  of  R: 


Rn  —  Rk 


k=\ 


We  must  assess  how  large  n  must  be  for  Rn  to  achieve  a  reasonable  error,  and  we  must  bound 
the  computational  cost  of  the  resulting  estimator. 

It  is  helpful  to  frame  the  results  in  terms  of  the  stable  rank  of  the  matrices  B  and  C  that  appear 
in  the  product. 


Definition  6.4. 1  (Stable  Rank) .  The  stable  rank  of  a  matrix  F  is  defined  as 


srank(F)  = 


ml 

IIFII2' 


The Frobenius  norm  is  defined  by  the  relation  ||F||p  =  tr(FF*). 


The  stable  rank  is  a  lower  bound  for  the  algebraic  rank:  1  <  srank(F)  <  rank(F).  Check  these 
inequalities  by  expressing  the  two  norms  in  terms  of  the  singular  values  of  F.  In  contrast  with 
the  algebraic  rank,  the  stable  rank  is  a  continuous  function  of  the  matrix,  so  it  is  more  suitable 
for  numerical  applications. 

We  are  prepared  to  present  the  main  claim  about  randomized  matrix  multiplication.  Fix  a 
parameter  £  e  (0, 1].  Suppose  that  the  number  n  of  samples  satisfies 

5  •  srank(A)  •  srank(B)  •  log(di  +  d?) 
n  > - — - ^ - — - fi  (6.4.3) 

Ez 

Then  the  randomized  estimate  Rn  for  the  product  achieves  a  relative  error  of  e  in  the  spectral 
norm: 

E||fln-BC||  <e||B||  ||C ||  (6.4.4) 

To  compute  Rn,  we  need  0(nd\  d2)  arithmetic  operations.  Therefore,  the  estimator  is  efficient 
when  the  number  n  of  samples  is  much  smaller  than  N,  the  inner  dimension  of  the  product  BC. 

This  result  is  natural  because  a  matrix  with  low  stable  rank  contains  a  lot  of  redundant  in¬ 
formation.  As  a  consequence,  we  do  not  need  to  multiply  each  column  of  B  with  each  row  of  C 
to  get  a  good  estimate  for  the  product.  In  particular,  when  the  outer  dimensions  d\  and  d2  are 
much  smaller  than  the  inner  dimension  N,  many  of  the  terms  in  (6.4.2)  can  be  omitted  without 
a  significant  loss. 

Remark  6.4.2.  Since  our  goal  is  to  illustrate  the  analysis  of  a  random  matrix,  the  algorithmic 
details  are  not  especially  important.  Nevertheless,  we  should  point  out  that  the  method  we  have 
described  is  not  the  most  effective  way  to  perform  randomized  matrix  multiplication.  It  is  better 
to  apply  a  preprocessing  step  to  ensure  that  the  columns  of  B  have  comparable  norms  and  that 
the  rows  of  C  have  comparable  norms.  In  this  case,  it  is  possible  to  obtain  a  somewhat  better 
bound  on  the  number  of  samples  required. 
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The  Analysis 

To  study  the  behavior  of  randomized  matrix  multiplication,  we  introduce  the  error  matrix 

\  n  n 

E=Rn-BC=-Y,(Rk-BC)  =  Y.Sk 

n  £ i 

The  random  matrices  Sp  are  defined  by  the  previous  expression.  Observe  that  the  summands  are 
independent,  and  each  has  zero  mean.  Therefore,  we  can  apply  the  matrix  Bernstein  inequality 
to  study  the  expected  norm  of  the  error. 

First,  let  us  bound  the  norm  of  a  generic  summand  S  =  n~l  {R  BC).  Note  that 

1  N 

||  if  ||  <  maxy - ||h:yCy:||  -  ^  ll*:fcll  II  Cfc  II  <  l|B||F  II  C||F 

Pj  k=  1 

The  last  inequality  is  Cauchy-Schwarz.  Therefore,  we  have  the  uniform  bound 

IISII  -  -  Ilfl-BCII  <  -  (llflll  +  ||£[|  IICII)  <  -  IIBIIf  IICIIf. 
n  n  n 

Observe  that  the  bound  decreases  with  the  number  n  of  samples. 

Next,  we  compute  the  variance  of  E.  This  takes  some  effort.  First,  consider  a  generic  sum¬ 
mand  S.  Form  the  expectation 

E\SS*\  =  -Ee[(B-BCHB-BC)*1  =  -E  \e[rr*)-bcc*b*}. 

nz  nz 

Let  us  focus  on  the  first  term  on  the  right-hand  side. 


E  77  (bJcT)  {b-,icj:Y 

N  I  9  9 

(  N 

E II M 

7=1  Pi 

7=1  Pi 

U=1 

, 

||e(j?j?*)||  = 

In  combination,  the  last  two  displays  yield 

||E(SS*)||  <  ^  [ll Blip  II Clip  +  ||BCC*B||  ]  <  ^  II B ||p  || C|| p . 
To  obtain  the  variance  of  the  error  matrix  E,  we  calculate  that 


—  II  B||p  ||  Clip . 


II E  [EE*)\\  = 


E  E(sJsk) 

j,k=l 


EEM) 

k=  1 


^  -  IIB||p  || C||p . 
n 


The  second  identity  holds  because  the  summands  are  independent  and  zero  mean.  The  last 
bound  follows  from  the  triangle  inequality  and  the  calculation  for  a  generic  summand.  The  sec¬ 
ond  component  of  the  variance  does  not  require  any  additional  ideas,  and  we  reach  the  bound 

a2(E)  =  max{||E(BB*)|| ,  ||E(B*E)||}  <  ^  ||B|||  ||C|||. 

Observe  that  we  retain  the  favorable  dependence  on  the  number  n  of  samples. 
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We  have  acquired  what  we  need  to  apply  the  matrix  Bernstein  inequality.  Invoke  the  expec¬ 
tation  bound  (6.2.2)  to  reach 


E||-E||  < 


41og  id1  +  d2) 


21og(di  +  d2) 

Blip  llciiF  +  — hciIf. 

3  n 


With  our  choice  of  n  from  (6.4.3),  we  conclude  that 


E||E|| 


<— IIBII  l|C||  + 
5 

<e||B||  IICII. 


2e2  IIEII2  l|C||2 
15  II -Z? || F  HCIIf 


The  last  bound  holds  because  the  Frobenius  norm  dominates  the  spectral  norm.  This  is  the 
result  (6.4.4). 


6.5  Proof  of  the  Matrix  Bernstein  Inequalities 

In  establishing  the  matrix  Bernstein  inequality,  the  main  challenge  is  to  obtain  an  appropriate 
bound  for  the  matrix  mgf  and  cgf  of  a  zero-mean  random  matrix  whose  eigenvalues  satisfy  a 
uniform  bound.  We  do  not  present  the  sharpest  estimate  possible,  but  rather  the  one  that  leads 
most  directly  to  the  useful  results  stated  in  Theorem  6.1.1. 

Lemma  6.5.1  (Matrix  Bernstein:  Mgf  and  Cgf  Bound).  Suppose  that  X  is  a  random  Hermitian 
matrix  that  satisfies 

EX  =  0  and  Amax(X)  <  R. 

Then,for0  <  9  <  31 R, 

Eeex=^exp( - E(X2)]  and  logEeex^ - - - E(X2). 

^{l-Rd/3  [  6  1  -  Rd/3  (  1 

Proof.  Fix  the  parameter  8  >  0.  In  the  exponential  eex,  we  would  like  to  expose  the  random 
matrix  X  and  its  square  X2  so  that  we  can  exploit  information  about  the  mean  and  variance.  To 
that  end,  we  write 

eBX  =  I  +  QX  +  (esx  -  8X  - 1)  =  I  +  8X  +  X  •  /(X)  •  X,  (6.5.1) 

where  /  is  a  function  on  the  real  line: 

eBx-dx-l  82 

fix')- - t. -  forx^O  and  /( 0)  =  — . 

2 

The  function  /  is  increasing  because  its  derivative  is  positive.  Therefore,  fix)  <  fiR)  when  x  <  R. 
By  assumption,  the  eigenvalues  of  X  do  not  exceed  R,  so  the  Transfer  Rule  (2.1.6)  implies  that 

/(X)  =4  fiR)- 1.  (6.5.2) 

The  Conjugation  Rule  (2.1.4)  allows  us  to  introduce  the  relation  (6.5.2)  into  our  expansion  (6.5.1) 
of  the  matrix  exponential: 

eex  4 1  +  8X  +  XifiR)  ■  I)X  =  I  +  ex  +  fiR)  ■  X2. 
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Take  the  expectation  of  this  semidefinite  bound  to  reach 

Ee0x^I  +  /(R)-E(X2).  (6.5.3) 

The  expression  (6.5.3)  provides  a  powerful  bound  for  the  matrix  mgf.  In  fact,  this  result  leads  to 
the  matrix  Bennett  inequality,  which  strengthens  Theorem  6.1.1.  We  have  chosen  to  present  the 
weaker  result  because  it  is  easier  to  apply  in  practice.  To  arrive  at  the  mgf  bound  required  for 
Theorem  6.1. 1,  we  must  keep  working. 

We  need  an  inequality  for  the  quantity  f{R).  This  argument  involves  a  clever  application  of 
Taylor  series: 


m  = 


eR6  -RB-1 
~R2 


1  “  (RB)^  ___  62  “  (fid)'?"2 

30-2 


B2 12 
1  -  R6I3 


(6.5.4) 


The  second  expression  is  simply  the  Taylor  expansion  of  the  fraction,  viewed  as  a  function  of 
8.  We  obtain  the  inequality  by  factoring  out  (RB)2 12  from  each  term  in  the  series  and  invoking 

the  bound  q\  >  2  •  3q~2,  valid  for  each  q  =  2, 3, 4, _ Sum  the  geometric  series  to  obtain  the  final 

identity. 

Introduce  the  inequality  (6.5.4)  for  f[R)  into  the  semidefinite  bound  (6.5.3)  for  the  matrix 
mgf  to  reach 


Ee0x  =<:!+■ 


B2 12 
1-R8I3 


E(X2)  =$  exp 


82I2  \ 
1-R8I3)' 


The  second  semidefinite  relation  follows  when  we  apply  the  Transfer  Rule  (2. 1 .6)  to  the  inequality 
1  +  a  <  ea,  which  holds  for  a  e  [R. 

To  obtain  the  semidefinite  bound  for  the  cgf,  we  extract  the  logarithm  of  the  mgf  bound  using 
the  fact  (2.1.9)  that  the  logarithm  is  operator  monotone.  Q 


We  are  prepared  to  establish  the  matrix  Bernstein  inequalities  for  random  Hermitian  matri¬ 
ces. 


Proof  of  Theorem  6.1.1.  Consider  a  finite  sequence  {XQ  of  random  Hermitian  matrices  with  di¬ 
mension  d.  Assume  that 

EXk  =  0  and  A„,ax(X^)  <  R. 

The  matrix  Bernstein  cgf  bound,  Lemma  6.5.1,  provides  that 

q2  12 

log EeBXk  =4  g{8)  •  E(X^)  where  g{8)  =  - — — —  forO<0<3/R.  (6.5.5) 

1  —  Hu  /  o 

Define  the  sum  Y  -  Y.k^-k>  which  it  is  our  task  to  analyze. 

We  begin  with  the  bound  (6.1.2)  for  the  expectation  EAmax(F).  Invoke  the  master  inequality, 
relation  (3.6.1)  in  Theorem  3.6.1,  to  find  that 

E  Ama,(n  <  inf  1  log  trexp  (^^log  Ee0x,:) 

=oJ;Lslostr“pi<;ie)t(r2))' 
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As  usual,  to  move  from  the  first  to  the  second  line,  we  invoke  the  fact  (2.1.7)  that  the  trace  ex¬ 
ponential  is  monotone  to  introduce  the  semidefinite  bound  (6.5.5)  for  the  cgf.  The  rest  of  the 
argument  glides  along  a  well-oiled  track: 

EAmax(F)<  inf  ^log[dAmax(exp(g(0)-E(r2)))] 

0<6<3/R  u 

=  0«9<3/K  ^l08[deXPU^-Amax(E(F2)))] 

=  inf  -log  \d  exp(g(0)  - cr2)l 

o<e<3iR  0  81  n 

log  d  6/2  9 

- + - £7  . 

9  1-R6/3 

In  the  first  inequality,  we  bound  the  trace  of  the  exponential  by  the  dimension  d  times  the  max¬ 
imum  eigenvalue.  The  next  line  follows  from  the  Spectral  Mapping  Theorem,  Proposition  2.1.3. 
In  the  third  line,  we  identify  the  variance  parameter  (6.1.1).  Afterward,  we  extract  the  logarithm 
and  simplify.  Finally,  we  minimize  the  expression — ideally  with  a  computer  algebra  system — to 
complete  the  proof  of  (6.1.2). 

Next,  we  develop  the  tail  bound  (6.1.3)  for  Amax(F).  Owing  to  the  master  tail  inequality  (3.6.3), 
we  have 

P{Amax(F)  >t}<  inf  e~et  trexp  (£fclog  Eee^~) 

-oi<Le"ettrexp(g(0)^fcE(^) 

=  inf  de~6t  exp(g(0)  -cr2) . 

O<0<3  IR 

The  justifications  are  the  same  as  before.  The  exact  value  of  the  infimum  is  messy,  so  we  proceed 
with  the  inspired  choice  8  —  f / (cr2  +  Rt/3),  which  results  in  the  elegant  bound  (6.1.3).  □ 

Finally,  we  explain  how  to  derive  Corollary  6.2.1,  for  general  matrices,  from  Theorem  6.1.1. 
This  result  follows  immediately  when  we  apply  the  matrix  Bernstein  bounds  for  Hermitian  ma¬ 
trices  to  the  Hermitian  dilation  of  a  sum  of  general  matrices. 

Proof  of  Corollary  6.2.1.  Consider  a  finite  sequence  {SO  of  d\  x  d2  random  matrices,  and  assume 
that 

ESfc  =  0  and  A  max(Sfc)<J? 

We  define  the  two  random  matrices 


=  inf 

O<0<3  IR 


z  =  Y.kSk  and  F  =  J^(Z). 

where  Jt?  is  the  Hermitian  dilation  (2.1.11).  We  will  invoke  Theorem  6.1.1  to  analyze  ||Z||.  First, 
recall  the  fact  (2.1.13)  that 

l|Z||=AmaX(^(Z))  =  Amax(F). 

Next,  we  express  the  variance  (6.1.1)  of  the  random  Hermitian  matrix  F  in  terms  of  the  general 
matrix  Z.  Indeed, 


a 


:(F)=||E(F2)||  =  ||E(jr(Z)2)||  = 


zz* 

0 


0 

z*z 
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=  max]  ||  E(ZZ* )  || ,  ||  E(Z*  Z)  || }  =  a2  (Z) . 

The  third  relation  is  the  identity  (2.1.12)  for  the  square  of  the  Hermitian  dilation.  The  penulti¬ 
mate  equation  holds  because  the  norm  of  a  block-diagonal  matrix  is  the  maximum  norm  of  any 
diagonal  block.  We  obtain  the  formula  (4.2.1)  for  the  variance  of  the  matrix  Z.  Finally,  we  invoke 
Theorem  6.1.1  to  establish  Corollary  6.2.1.  □ 


E(ZZ*)  0 
0  E(Z*Z) 


6.6  Notes 

There  are  a  wide  variety  of  Bernstein-type  inequalities  available  in  the  scalar  case,  and  the  matrix 
case  is  no  different.  The  applications  of  the  matrix  Bernstein  inequality  are  also  numerous.  We 
only  give  a  brief  summary  here. 

6.6.1  Matrix  Bernstein  Inequalities 

David  Gross  [Gro09]  and  Ben  Recht  [Recll]  used  the  approach  of  Ahlswede-Winter  [AW02]  to 
develop  two  different  versions  of  the  matrix  Bernstein  inequality.  These  papers  played  a  big  role 
in  popularizing  the  use  matrix  concentration  inequalities  in  mathematical  signal  processing  and 
statistics.  Nevertheless,  their  results  involve  a  suboptimal  variance  parameter  of  the  form 

In  general,  this  parameter  is  significantly  larger  than  the  variance  (6.1.1)  that  appears  in  Theo¬ 
rem  6.1.1.  They  do  coincide  in  some  special  cases,  such  as  when  the  summands  are  independent 
and  identically  distributed. 

Oliveira  [OlilOa]  established  the  first  version  of  the  matrix  Bernstein  inequality  that  yields 
the  correct  variance  parameter  (6.1.1).  He  accomplished  this  task  with  an  elegant  application 
of  the  Golden-Thompson  inequality  (3.3.3).  His  method  even  gives  a  result,  called  the  matrix 
Freedman  inequality,  that  holds  for  matrix- valued  martingales.  His  bound  is  roughly  equivalent 
with  Theorem  6.1.1,  up  to  the  precise  value  of  the  constants. 

The  matrix  Bernstein  inequality  we  have  stated  here,  Theorem  6.1.1,  first  appeared  in  the  pa¬ 
per  [Trolld,  §6].  The  bounds  for  the  expectation  are  new.  The  argument  is  based  on  Lieb’s  The¬ 
orem,  and  it  also  delivers  a  matrix  Bennett  inequality,  and  the  split  Bernstein  inequality  (6.1.4) 
discussed  here.  This  paper  also  describes  how  to  establish  matrix  Bernstein  inequalities  for  sums 
of  unbounded  random  matrices,  given  some  control  over  the  matrix  moments. 

The  research  in  [Trol  Id]  is  independent  from  Oliveira’s  ideas  [OlilOa] .  Motivated  by  Oliveira’s 
paper,  the  article  [Trolla]  and  the  technical  report  [Trollc]  show  how  to  use  Lieb’s  Theorem  to 
study  matrix  martingales.  The  subsequent  paper  [GT11]  explains  how  to  develop  a  Bernstein 
inequality  for  interior  eigenvalues  using  the  Lieb-Seiringer  Theorem  [LS05] . 

For  more  versions  of  the  matrix  Bernstein  inequality,  see  Vladimir  Koltchinskii’s  lecture  notes 
from  Saint-Flour  [Kolll], 

6.6.2  Randomized  Matrix  Multiplication 

The  idea  of  using  random  sampling  to  accelerate  matrix  multiplication  appears  in  a  paper  by 
Drineas,  Kannan,  and  Mahoney  [DKM06].  Subsequently,  Tamas  Sarlos  obtained  a  significant 
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improvement  in  the  performance  of  this  algorithm  [Sar06] .  The  analysis  we  have  given  here 
is  a  corrected  version  of  the  argument  in  the  work  of  Hsu,  Kakade,  and  Zhang  [HKZ12b];  see 
also  [HKZ12a].  A  related  analysis  appears  in  the  paper  of  Magen  andZouzias  [MZ11]. 

6.6.3  Randomized  Sparsification 

The  idea  of  using  randomized  sparsification  to  accelerate  spectral  computations  appears  in  a 
paper  of  Achlioptas  and  McSherry  [AM07].  Drineas  and  Zouzias  [DZ11]  point  out  that  matrix 
concentration  inequalities  can  be  used  to  analyze  this  type  of  algorithm.  For  further  results  on 
sparsification,  see  the  paper  [GT]. 


CHAPTER 


Results  Involving  the  Intrinsic 

Dimension 


A  minor  shortcoming  of  our  matrix  concentration  results  is  the  dependence  on  the  ambient 
dimension  of  the  matrix.  In  this  chapter,  we  show  how  to  obtain  a  dependence  on  an  intrin¬ 
sic  dimension  parameter,  which  is  sometimes  much  smaller  than  the  ambient  dimension.  In 
many  cases,  intrinsic  dimension  bounds  offer  only  a  modest  improvement.  Nevertheless,  there 
are  examples  where  the  benefits  are  significant  enough  that  we  can  obtain  nontrivial  results  for 
infinite-dimensional  random  matrices. 

We  present  a  version  of  the  matrix  Chernoff  inequality  for  an  independent  sum  of  bounded, 
positive-semidefinite  random  matrices  that  involves  an  intrinsic  dimension  parameter.  This  re¬ 
sult  is  interesting,  but  it  is  not  entirely  satisfactory  because  it  lacks  a  bound  for  the  minimum 
eigenvalue.  We  also  describe  a  version  of  the  matrix  Bernstein  inequality  for  an  independent 
sum  of  bounded,  zero-mean  random  matrices  that  involves  an  intrinsic  dimension  parameter. 
The  intrinsic  Bernstein  result  often  improves  on  Theorem  6.1.1.  We  omit  intrinsic  dimension 
bounds  for  matrix  series,  which  the  reader  may  wish  to  develop  as  an  exercise. 

To  give  a  sense  of  what  these  new  results  accomplish,  we  reconsider  some  of  the  examples 
from  earlier  chapters.  We  apply  the  intrinsic  Chernoff  bound  to  study  a  random  column  subma¬ 
trix  of  a  fixed  matrix,  and  we  use  the  intrinsic  Bernstein  bound  to  analyze  the  sample  covariance 
estimator.  In  each  case,  the  intrinsic  dimension  parameters  have  an  attractive  interpretation  in 
terms  of  the  problem  data. 

We  begin  our  development  in  §7.1  with  the  definition  of  the  intrinsic  dimension  of  a  ma¬ 
trix.  In  §7.2,  we  present  the  intrinsic  Chernoff  bound  and  some  of  its  consequences.  In  §7.3, 
we  describe  the  intrinsic  Bernstein  bounds  and  their  applications.  Afterward,  we  describe  the 
new  ingredients  that  are  required  in  the  proofs.  Section  7.4  explains  how  to  extend  the  matrix 
Laplace  transform  method  beyond  the  exponential  function,  and  §7.5  describes  a  simple  but 
powerful  lemma  that  allows  us  to  obtain  the  dependence  on  the  intrinsic  dimension.  Section  7.6 
contains  the  proof  of  the  intrinsic  Chernoff  bound,  and  §7.7  develops  the  proof  of  the  intrinsic 
Bernstein  bound. 
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7. 1  The  Intrinsic  Dimension  of  a  Matrix 


Some  types  of  random  matrices  are  concentrated  in  a  small  number  of  dimensions,  while  they 
have  little  content  in  other  dimensions.  So  far,  our  bounds  do  not  account  for  the  difference.  We 
need  to  introduce  a  more  refined  notion  of  dimension  that  will  allow  us  to  discriminate  among 
these  examples. 


Definition  7.1.1  (Intrinsic  Dimension).  Fora  positive-semidefinite  matrix  A,  the  intrinsic  dimen¬ 
sion  is  the  quantity 


intdim(A) 


1 1  A 

m' 


By  expressing  the  trace  and  the  norm  in  terms  of  the  eigenvalues,  we  can  verify  that 


1  <  intdim(A)  <  rank(A)  <  d i m (A)- 


The  lower  inequality  is  attained  precisely  when  A  has  rank  one,  while  the  upper  inequality  is 
attained  precisely  when  A  is  a  multiple  of  the  identity.  Note  that  the  intrinsic  dimension  is  0- 
homogeneous,  so  it  is  insensitive  to  changes  in  the  scale  of  the  matrix  A.  We  interpret  the  in¬ 
trinsic  dimension  as  a  reflection  of  the  number  of  dimensions  where  A  has  significant  spectral 
content. 


7.2  Matrix  Chernoff  with  Intrinsic  Dimension 

Let  us  begin  with  an  extension  of  the  matrix  Chernoff  inequality.  We  obtain  bounds  for  the  maxi¬ 
mum  eigenvalue  of  a  sum  of  bounded,  positive-semidefinite  matrices  that  depend  on  the  intrin¬ 
sic  dimension  of  the  expectation  of  the  sum. 

Theorem  7.2.1  (Matrix  Chernoff:  Intrinsic  Dimension).  Consider  a  finite  sequence  {X^}  of  ran¬ 
dom,  Hermitian  matrices  that  satisfy 

Xjc  )p=  0  and  Amax(-^ftr)  —  fh 

Define  the  random  matrix 

Y  =  Lk^. 

Introduce  an  intrinsic  dimension  parameter  and  a  mean  parameter: 

d  =  d  ( F)  =  intdim  (E  Y)  and  pmax  =  pmax  ( Y)  =  Amax  (E  Y) . 

Then,  ford  >  0, 

e0-l  1 

EAmax(T)  <  — —  •  AW  +  -  -i?log(2d).  (7.2.1) 

U  tf 

Furthermore, 


P{Amaxm>(l  +  <5)pmax}<2d- 


e 


5 


Mmax/  R 


(1  +  5) 1+5 


ford  >  1. 


(7.2.2) 


The  proof  of  this  result  appears  below  in  §7.6. 
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7.2.1  Discussion 

Theorem  7.2.1  is  almost  identical  with  the  parts  of  the  basic  matrix  Chernoff  inequality  that  con¬ 
cern  the  maximum  eigenvalue  Amax(F).  Let  us  call  attention  to  the  differences.  The  key  advan¬ 
tage  is  that  the  current  result  depends  on  the  intrinsic  dimension  of  the  mean  E  F  instead  of  the 
ambient  dimension.  When  the  eigenvalues  of  E  Y  decay,  the  improvement  can  be  dramatic.  We 
do  suffer  a  small  cost  in  the  extra  factor  of  two,  and  the  tail  bound  is  restricted  to  a  smaller  range 
of  the  parameter  S.  Neither  of  these  limitations  is  particularly  significant. 

A  more  serious  flaw  in  Theorem  7.2.1  is  that  it  does  not  provide  any  information  about  the 
minimum  eigenvalue  Amin(F).  Curiously,  the  approach  we  use  to  prove  the  result  just  does  not 
work  for  the  minimum  eigenvalue. 

7.2.2  Example:  A  Random  Column  Submatrix 

To  demonstrate  the  value  of  Theorem  7.2.1,  we  apply  it  to  bound  the  expected  norm  of  a  random 
column  submatrix  drawn  from  a  fixed  matrix,  a  problem  we  considered  in  §5.2. 

In  this  example,  we  began  with  a  fixed  d  x  n  matrix  B,  and  we  formed  a  random  submatrix 
Z  containing  an  average  of  q  nonzero  columns  from  B.  In  the  analysis,  we  applied  the  matrix 
Chernoff  inequality  to  the  random  matrix  F  =  ZZ* ,  which  takes  the  form 


n 


YsVkbk-K:- 


k= 1 


Here,  {77^-}  is  an  independent  family  of  Bernoulli  random  variables  with  common  mean  qln.  We 
have  written  bfc  for  the  fcth  column  of  B. 

To  invoke  Theorem  7.2. 1,  we  just  need  to  compute  the  intrinsic  dimension  d  ( F)  =  intdim(E  F) . 
Recall  that  E  Y  -{q!  n)  BB* ,  so  that 


The  second  identity  holds  because  the  intrinsic  dimension  is  scale  invariant.  The  last  relation  is 
simply  Definition  6.4.1.  Therefore,  the  expectation  bound  (7.2.1)  with  0=1  delivers 


E  (|| Z||2)  =  E A max(F)  <  (e  -  1)  ■  pmax(F)  +  Rlog(2  •  srank(B)). 


In  contrast,  our  previous  analysis  led  to  a  logarithmic  factor  of  log  d.  If  the  matrix  B  has  deficient 
stable  rank — meaning  that  it  has  many  rows  which  are  almost  collinear — then  the  new  bound 
can  result  in  a  serious  improvement. 

7.3  Matrix  Bernstein  with  Intrinsic  Dimension 


We  continue  with  extensions  of  the  matrix  Bernstein  inequality.  These  results  provide  tail  bounds 
for  an  independent  sum  of  bounded  random  matrices  that  depend  on  the  intrinsic  dimension 
of  the  variance. 
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7.3.1  The  Hermitian  Case 

We  begin  with  the  results  for  an  independent  sum  of  Hermitian  random  matrices  whose  eigen¬ 
values  are  bounded  above. 

Theorem  7.3.1  (Matrix  Bernstein:  Hermitian  Case  with  Intrinsic  Dimension).  Consider  a  finite 
sequence  {X^}  of  random  Hermitian  matrices  that  satisfy 

EXfc  =  0  and  Amax(X^)  <  R . 


Define  the  random  matrix 

Y  =  LkXk- 

Introduce  the  intrinsic  dimension  and  variance  parameters 

d  -  d(Y)  -  intdim(E(F2))  and  a2  =  <t2(F)  =  ||E(F2)|| . 
Then,  for  t>a  +  RI3, 

I  -t2/2  \ 

PfAmaxCT)  >t}<4d-  exp  —r — — —  . 

(crz  +  Rt/3 ) 

The  proof  of  this  result  appears  below  in  §7.7. 


(7.3.1) 


Discussion 

Theorem  7.3.1  is  quite  similar  to  Theorem  6.1.1,  so  we  focus  on  the  differences.  Note  that  the 
tail  bound  (7.3.1)  now  depends  on  the  intrinsic  dimension  of  the  variance  matrix  E(F2),  which 
is  never  larger  than  the  ambient  dimension.  As  a  consequence,  the  tail  bound  is  almost  always 
sharper  than  the  earlier  result.  The  costs  of  this  improvement  are  small:  We  pay  an  extra  factor 
of  four,  and  we  must  restrict  our  attention  to  a  more  limited  range  of  the  parameter  t.  Neither  of 
these  changes  is  significant. 

We  can  obtain  a  bound  for  E  Amax(F)  by  integrating  the  tail  inequality  (7.3.1),  which  gives 
EAmax(F)  <  Const-  |er yjlogd  +  iUogdj . 

It  seems  likely  that  we  could  adapt  the  argument  to  obtain  a  more  direct  proof  of  the  expectation 
bound,  along  with  an  explicit  constant. 

The  other  commentary  about  the  original  matrix  Bernstein  inequality,  Theorem  6.1.1,  also 
applies  to  the  intrinsic  dimension  result.  Using  similar  arguments,  we  can  obtain  bounds  for 
Amin  ( F) ,  and  we  can  adapt  the  result  to  an  independent  sum  of  uncentered,  bounded,  random 
Hermitian  matrices.  The  modifications  required  in  these  cases  are  straightforward. 

Finally,  let  us  mention  a  subtle  but  important  point  concerning  the  application  of  Theo¬ 
rem  7.3.1.  It  is  often  difficult  or  unwieldy  to  compute  the  exact  values  of  the  parameters  d[Y) 
and  cr2(F).  In  this  case,  we  can  proceed  as  follows.  Suppose  that  E(F2)  =<:  V  for  some  positive- 
semidefinite  matrix  V .  A  slight  modification  to  the  proof  of  Theorem  6.1.1  yields  the  tail  bound 

IPUmaxtT)^  t]  <  4  ■  intdim(F)  •  exp  f -  —  ^ j  (7.3.2) 


for  all  t  >  ||  V|| 1/2  +  R/3.  This  version  of  the  result  is  often  much  easier  to  apply. 
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7.3.2  The  General  Case 

Next,  we  present  the  adaptation  for  an  independent  sum  of  general  random  matrices  that  are 
bounded  in  spectral  norm. 

Corollary  7.3.2  (Matrix  Bernstein:  Rectangular  Case  with  Intrinsic  Dimension) .  Consider  a  finite 
sequence  {,S/J  of  random  complex  matrices  that  satisfy 


ESfc^O  and  ||Sfc||<R. 


Define  the  random  matrix 


Z-Lksk. 

Introduce  the  intrinsic  dimension  parameter 


d  —  d{Z)  -  intdim 


E  (ZZ*)  0 

0  E(Z*Z) 


and  the  variance  parameter 


cr2=cr2(Z)  =  max{||E(ZZ*)||,  ||E(Z*Z) 


(7.3.3) 


Then,  for  t>  a  +  RI3, 

P  { || Z||  >  t }  <4d-exp 

The  proof  of  this  result  appears  below  in  §7.7. 


-t‘ 72 


(7.3.4) 


Discussion 

Corollary  7.3.2  is  very  similar  to  Theorem  7.3.1  and  our  earlier  result,  Corollary  6.2.1.  As  a  con¬ 
sequence,  we  limit  our  discussion  to  a  single  point.  Note  that  the  intrinsic  dimension  param¬ 
eter  (7.3.3)  is  computed  from  a  block-diagonal  matrix  that  contains  both  of  the  squares  of  the 
matrix  Z.  It  follows  that 

Etr(ZZ*)  +  Etr(Z*Z) 

d(Z)  = - . 

max{||E(ZZ*)|| ,  ||E(Z*Z)||} 

In  other  words,  we  divide  by  the  norm  of  the  larger  block.  We  can  make  a  further  bound  to  obtain 
a  result  in  terms  of  the  intrinsic  dimensions  of  the  two  blocks: 

d(Z)  <  intdim (E(ZZ* ))  +  intdim (E(Z*Z)) . 

An  interesting  consequence  is  that  the  intrinsic  dimension  d{Z)  can  be  much  smaller  than  the 
intrinsic  dimension  of  either  E(ZZ*)  or  E(Z*Z). 


7.3.3  Example:  Sample  Covariance  Matrices,  Redux 

To  demonstrate  the  value  of  the  intrinsic  dimension  results,  let  us  apply  Theorem  7.3.1  to  the 
sample  covariance  matrix  example  we  analyzed  in  §1.6.3. 

Consider  a  random  vector  x  with  zero  mean,  covariance  A,  and  uniform  upper  bound  ||  x  || 2  < 
B.  The  sample  covariance  matrix  Y  =  n~'  xkxl,  where  xi,...,xn  are  independent  samples 
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from  the  distribution  x.  Recall  that  the  random  matrix  of  interest  is  E  -  Y  -  A,  the  discrepancy 
between  the  sample  covariance  matrix  and  the  true  covariance. 

We  expressed  the  error  matrix  E  as  the  sum  of  the  independent  random  matrices 

1 

~{xkxk-A). 
n  K 

The  summands  have  the  properties  that  E St  =  0  and  HS^II  -  2 Bln.  Moreover,  E(S?)  ^  {Bln2)  ■  A, 
so  that 

?  B 

E(£2)  =4  --A 
n 

As  discussed,  we  may  substitute  the  semidefinite  upper  bound  V  —  {Bln)  ■  A  for  E(£2)  when  we 
compute  the  variance  parameter  and  the  intrinsic  dimension  parameter  in  Theorem  7.3.1. 

Let  us  introduce  the  intrinsic  dimension  and  variance  parameters 

intdim(F)  =  ^  and  II  V\\  =  -  II  A\\ . 

Mil  n 


We  can  apply  the  modified  tail  bound  (7.3.2)  to  both  E  and  -E  to  control  Amax(£)  and  Am;n(£). 
Combine  these  two  results  with  the  union  bound  to  reach  the  spectral  norm  estimate 


P{||F-£||>r}< 


8trA 


Mil 


•  exp 


—  f2 /2 

B\\A\\/n  +  2Btl3n 


valid  when  t  is  sufficiently  large.  To  achieve  a  relative  error  e  e  (0, 1] ,  the  number  n  of  samples 
should  satisfy 


n  >  Const 


B  log(intdim(A)) 


e2  Mil 


In  this  case,  we  obtain  a  tail  bound  of  the  form 


(7.3.5) 


IP  {||  Y  -  E\\  >  e  Mil  1  <  Const  •  intdim(A) "Const. 


By  increasing  the  number  n  of  samples,  we  can  increase  the  exponent  in  the  tail  probability. 

The  key  observation  is  that  the  intrinsic  dimension  term  in  (7.3.5)  may  be  much  smaller  than 
the  ambient  dimension  of  the  covariance  matrix  A.  For  instance,  if  the  ordered  eigenvalues  of  A 
satisfy  the  bounds 

A ; {A)  <  — -r  for  each  /  =  1,2,3,..., 

r 

then  the  logarithmic  factor  in  (7.3.5)  reduces  to  a  constant  that  is  independent  of  the  dimension 
of  the  covariance  matrix  A\ 

Finally,  we  note  that  this  result  has  an  attractive  interpretation:  The  intrinsic  dimension  pa¬ 
rameter  intdim(A)  is  the  total  variance  of  all  the  components  of  the  random  vector  x  divided  by 
the  maximum  variance  achieved  by  any  component  of  x. 


7.4  Revisiting  the  Matrix  Laplace  Transform  Bound 

After  some  reflection,  we  can  trace  the  dependence  on  the  ambient  dimension  in  our  earlier  re¬ 
sults  to  the  proof  of  Proposition  3.2.1.  In  the  original  argument,  we  used  an  exponential  function 
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to  transform  the  tail  event  before  applying  Markov’s  inequality.  This  approach  leads  to  trouble 
for  the  simple  reason  that  the  exponential  function  does  not  pass  through  the  origin,  which  gives 
undue  weight  to  eigenvalues  that  are  close  to  zero. 

We  can  resolve  this  problem  by  using  other  types  of  functions  to  transform  the  tail  event.  The 
functions  we  have  in  mind  are  adjusted  versions  of  the  exponential  function.  In  particular,  for 
fixed  9  >  0,  we  can  consider 

—  max{0,  e6t  -  1}  and  y/zit)  —  e6t  -  Qt—  1. 

Both  functions  are  nonnegative  and  convex,  and  they  are  nondecreasing  on  the  positive  real  line. 
In  each  case,  i//;( 0)  =  0.  At  the  same  time,  the  presence  of  the  exponential  function  allows  us  to 
exploit  our  bounds  for  the  trace  mgf. 

Proposition  7.4.1  (Generalized  Matrix  Laplace  Transform  Bound).  LetY  be  a  random  Hermitian 
matrix.  Let  y/  :  IR  — ►  R+  be  a  nonnegative  function  that  is  nondecreasing  on  [0,  oo) .  For  each  t  >  0, 

P{Amax(F)>t}<-^-Etri//(F). 

y/{t) 

Proof.  The  proof  follows  the  same  lines  as  the  proof  of  Proposition  3.2.1,  but  it  requires  some 
additional  finesse.  Since  y/  is  nondecreasing  on  [0,oo),  the  bound  a  >  t  implies  that  y/{a)  >  yr{t). 
It  follows  that 

^max  (F)>f  =>  A  max{yj(Y))>yj{t). 

Indeed,  on  the  tail  event  Amax(F)  >  t,  we  must  have  i//(Amax(F))  >  yj ( t) .  The  Spectral  Mapping 
Theorem,  Proposition  2. 1.3,  indicates  that  t/dAmax(F))  is  among  the  eigenvalues  of  yi(Y),  and  we 
determine  that  Amax(i //(F))  also  exceeds  yi  ( t) . 

Returning  to  the  tail  probability,  we  discover  that 

P{Amax(F)  >  t]  <  P{Amax(i/dF))  >  y/(t)\  <  -j-  EA max(V(F)). 

yr{t) 

The  second  bound  is  Markov’s  inequality  (2.2.1),  which  is  valid  because  yr  is  nonnegative.  Finally, 

P{Amax(F)>t}<-^-Etri//(F). 

yr{t) 

The  inequality  holds  because  of  the  fact  (2.1.5)  that  the  trace  of  i//(F),  a  positive-semidefinite 
matrix,  must  be  as  large  as  its  maximum  eigenvalue.  □ 

7.5  The  Intrinsic  Dimension  Lemma 

The  other  new  ingredient  is  a  simple  observation  that  allows  us  to  control  a  trace  function  ap¬ 
plied  to  a  positive-semidefinite  matrix  in  terms  of  the  intrinsic  dimension  of  the  matrix. 

Lemma  7.5.1  (Intrinsic  Dimension).  Let  <p  be  a  convex  function  on  the  interval  [0,oo)  with  <p{Q)  = 
0.  For  any  positive-semidefinite  matrix  A,  it  holds  that 


tt(p{A)  <  intdim(A)  -(pdlAH). 
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Proof.  Since  a  >-«■  cp [a)  is  convex  on  the  interval  (0,  R\,  it  is  bounded  above  by  the  chord  connect¬ 
ing  the  endpoints.  That  is,  for  a  e  [0,  R] , 

l  a\  a  a 

(pta)<  1--  }-<pm  +  --(pUt)  =  --(pUV. 

'  li '  K  K 

The  eigenvalues  of  A  fall  in  the  interval  [0,0],  where  R  =  ||  A|| .  As  an  immediate  consequence  of 
the  Transfer  Rule  (2.1.6),  we  find  that 

tY  A 

tr tp(A)  <  —  -(pttlAW). 

Identify  the  intrinsic  dimension  of  A  to  complete  the  argument.  □ 


7.6  Proof  of  the  Intrinsic  Chernoff  Bound 


With  these  results  at  hand,  we  are  prepared  to  prove  our  first  intrinsic  dimension  result,  which 
extends  the  matrix  Chernoff  inequality. 


Proof  of  Theorem  7.2.1.  Consider  a  finite  sequence  {XO  of  independent,  random  Hermitian  ma¬ 
trices  with 

X ^  ■ ' ■  0  and  -1  max(-^ut)  —  0- 


Introduce  the  sum 

r  =  £***• 

The  challenge  is  to  establish  bounds  for  Amax  ( Y )  that  depends  on  the  intrinsic  dimension  of  the 
matrix  E  Y.  We  begin  the  argument  with  the  proof  of  the  tail  bound  (7.2.2).  Afterward,  we  show 
how  to  extract  the  expectation  bound  (7.2.1). 

Fix  a  number  6  >  0,  and  define  the  function  y/{t)  =  max{0,  e0t  -  1}  for  t  e  RL  The  general 
version  of  the  matrix  Laplace  transform  bound,  Proposition  7.4.1,  states  that 


pumaxm>  t}< 


y/(t) 


Etri/'fF)  = 


,0f 


-1 


Etr(eyy-I). 


(7.6.1) 


We  have  exploited  the  fact  that  Y  is  positive  semidefinite  and  that  t  >  0.  The  presence  of  the 
identity  matrix  on  the  right-hand  side  allows  us  to  draw  stronger  conclusions  than  we  could 
before. 

Let  us  study  the  expected  trace  term  on  the  right-hand  side  of  (7.6.1).  As  in  the  proof  of  our 
original  matrix  Chernoff  bound,  Theorem  5.1.1,  we  have  the  bound 


E  tr  e0  F  <  tr  exp  (g  (0)  (E  F))  where  g(0)  = - . 

R 

Invoke  the  latter  inequality,  and  introduce  the  function  (p{a)  =  ea  -  1  to  see  that 
Etr(e0F  - 1)  <  tup  (g(0)  (E  F))  <  intdim(E  F)  •  <p  (g(0)  ||E  F||) . 


The  second  inequality  results  from  Lemma  7.5.1,  the  intrinsic  dimension  bound,  and  the  fact 
that  the  intrinsic  dimension  does  not  depend  on  the  scaling  factor  g(0).  Recalling  the  notation 
d  =  intdim(E  F)  and  /imax  =  II E  F|| ,  we  continue  the  calculation: 

Etr(e0F  -  I )<d-(p  (g(0)  •  pmax)  <  d  •  exp  (g(0)  •  pmax) . 


(7.6.2) 
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We  have  used  the  trivial  bound  <p(d)  <  ea,  which  holds  for  ael. 

To  complete  the  argument,  introduce  the  bound  (7.6.2)  on  the  expected  trace  into  the  prob¬ 
ability  bound  (7.6.1)  to  obtain 

a6t 


PUrnaxdO  >t}<d- 


e8t  _  ]_ 


-8t+g{6)-H  max 


It  is  convenient  to  make  the  change  of  variables  f  ■ — ►  (1  +  5)prnax.  The  previous  estimate  is  valid 
for  all  9  >  0,  so  we  can  select  8  —  J?-1  log(l  +  8)  to  minimize  the  final  exponential.  To  bound  the 
fraction,  observe  that 

ra  1  1 

■  =  1  + - <  1  +  —  for  a  >  0. 

ea  - 1  a 


ea  - 1 


We  obtain  the  latter  inequality  by  replacing  the  convex  function  a  >-»■  ea  -  1  with  its  tangent  at 
a  -  0. 

Altogether,  these  steps  lead  to  the  estimate 


P  {-^max(F)  —  (1  +  8)fl  max}  —  d  *  1 1  + 


R!  /in 


(1  +  5)  log(l  +  5) 


(1  +  5) 


1+8 


Mmax/ 


(7.6.3) 


For  random  matrices,  this  inequality  is  rarely  useful  when  8  <  1,  so  it  does  little  harm  to  place 
the  restriction  that  8  >  1.  Subject  to  this  condition,  the  bracket  (including  the  exponent)  exceeds 
one  unless  we  also  have 

(1  +  5)  log(l  +  5)  > - . 

Prnax 

Therefore,  we  can  use  the  latter  bound  to  make  a  numerical  estimate  for  the  parenthesis  in  (7.6.3), 
which  leads  to  the  conclusion  (7.2.2). 

Now,  we  turn  to  the  expectation  bound  (7.2.1).  Observe  that  the  functional  inverse  of  i//  is  the 
increasing  concave  function 

i/^_1(m)  =  —  log(l  +  u)  fort<>0. 

8 

Since  Y  is  a  positive-semidefinite  matrix,  we  can  calculate  that 


EAmax(F)  =  Ey/~1{y/{A. 

max  (F)))<t//-1(Et//(A 

max  (F))) 

-  V_1(EAma x(y(F)))  <  V_1(Etr^(F)).  (7.6.4) 

The  second  relation  is  Jensen’s  inequality  (2.2.2),  which  is  valid  because  y/~l  is  concave.  The  third 
relation  follows  from  the  Spectral  Mapping  Theorem,  Proposition  2.1.3,  because  the  function  ifr 
is  increasing.  We  can  bound  the  maximum  eigenvalue  by  the  trace  because  -i//(F)  is  positive 
semidefinite  and  t//~ 1  is  an  increasing  function. 

Now,  substitute  the  bound  (7.6.2)  into  the  last  display  (7.6.4)  to  reach 


E  A max(F)  <y/  1  [d  ■  exp(g(0)  •  pmax))  =  \  log(l  +  d  ■ 

U 

<  i  log(2d  ■  egt0)^max)  =  i  (log(2d)  +  g(8)  ■  pmax) . 

C7  C7 

The  first  inequality  again  requires  the  property  that  y/~l  is  increasing.  The  second  inequality 
follows  because  1  <  d  ■  e^t0)  Pmax,  which  owes  to  the  fact  that  the  exponent  is  nonnegative.  To 
complete  the  argument,  introduce  the  definition  of  g(0),  and  make  the  change  of  variables  8  >-► 
8IR.  These  steps  yield  (7.2.1).  □ 
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7.7  Proof  of  the  Intrinsic  Bernstein  Bounds 

In  this  section,  we  present  the  arguments  that  lead  up  to  the  intrinsic  Bernstein  bounds.  That  is, 
we  develop  tail  inequalities  for  an  independent  sum  of  bounded  random  matrices  that  depend 
on  the  intrinsic  dimension  of  the  variance. 


7.7.1  The  Hermitian  Case 


We  commence  with  the  results  for  an  independent  sum  of  random  Hermitian  matrices  whose 
eigenvalues  are  subject  to  an  upper  bound. 


Proof  of  Theorem  7.3.1.  Consider  a  finite  sequence  {Xk}  of  independent,  random,  Hermitian  ma¬ 
trices  with 

EXk  =  0  and  Amax(XQ  <  R. 


Introduce  the  random  matrix 


Y  =  LkXk- 


It  is  our  goal  to  obtain  a  tail  bound  for  Amax(F)  that  reflects  the  intrinsic  dimension  of  its  variance 
E(F2). 


Fix  a  number  6  >  0,  and  define  the  function  y/(t)  =  eet  —  8t—  1  for  teR.  The  general  version 
of  the  matrix  Laplace  transform  bound,  Proposition  7.4.1,  implies  that 


P{Amax(F)>  t}< 


1 


y{t) 

l 

y/(t) 


Etri f{Y) 


Etr(eyy-0F-I) 


(7.7.1) 


1 


eBt  -9t  -  1 


Etr(eyy-l). 


The  last  identity  holds  because  the  random  matrix  F  has  zero  mean. 

Let  us  focus  on  the  expected  trace  on  the  right-hand  side  of  (7.7.1).  Examining  the  proof  of 
the  original  matrix  Bernstein  bound,  Theorem  6.1.1,  we  recall  that 


Etre0y  <  trexp  (g(0)  •  E(F2)) 


where 


g(0)  =  exp 


G2I2  ) 
1-R0I3J' 


Applying  this  inequality  and  introducing  the  function  <p(a)  —ea-l,  we  obtain 

Etr(e0y-l)<tr(e«(0)E(F2)-l) 

=  tr <p[g{0)  E(F2)) 

<  intdim(E(F2))  -q>[g{9)  ||E(F2)||) 


The  last  inequality  depends  on  the  intrinsic  dimension  result,  Lemma  7.5.1,  and  the  fact  that 
the  intrinsic  dimension  does  not  depend  on  the  scaling  factor  g(0).  Identify  the  dimensional 
parameter  cl  =  intdim  (E(F2))  and  the  variance  parameter  a2  =  ||  E(F2)  || .  It  follows  that 

Etr(e0F  - 1)  <  d  ■  (p  ( g(G )  •  a2)  <  d  ■  exp  (g(0)  •  a2) . 


This  bound  depends  on  the  obvious  estimate  y>{a)  <  ea,  valid  for  all  a  e  Ifi. 


(7.7.2) 
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Substitute  the  bound  (7.7.2)  into  the  probability  estimate  (7.7.1)  to  reach 


P0f  , 

P  Umax(F)  >  t]  <  d  ■  - - -  •  e-0t+gm-*T 

eet-6t-  1 

This  estimate  holds  for  any  positive  value  of  0.  Choose  9  -  tl [a2  +  Rtl 3)  to  obtain  a  nice  form 
for  the  final  exponential.  To  control  the  fraction,  we  remark  that 

ea  1  +  a  3 

- =  1  + - <  1  H — -  for  all  a  >  0. 

ea-a- 1  ea-a- 1  az 

The  inequality  above  follows  from  the  fact 

ea  -  a  -  1  1  +  a 

- » - >  0  for  all  a  e  OS. 

a2  3 

Indeed,  the  left-hand  side  of  the  latter  expression  defines  a  convex  function  of  a,  whose  minimal 
value,  attained  near  a  ~  1.30,  is  strictly  positive. 

Combine  the  results  from  the  last  paragraph  to  reach 


P{Amax(F)>  t}<d- 


I  3{a2  +  Rtl3)2 

1+ - - 

l  t4 


•  exp 


-t2l  2  \ 
u2  +  Rtl  3  j 


This  probability  inequality  is  typically  vacuous  when  t2  <  cr2  +  Rt/3,  so  we  may  as  well  limit  out 
attention  to  the  case  where  t2  >  o2  +  Rt/3.  Under  this  assumption,  the  parenthesis  is  bounded 
by  four,  which  gives  the  tail  bound  (7.3.1).  We  can  simplify  the  restriction  on  f  by  solving  the 
quadratic  inequality  to  obtain  the  sufficient  condition 
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We  develop  an  upper  bound  for  the  right-hand  side  of  this  inequality  as  follows. 
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We  have  used  the  numerical  fact  Vci+b  <  ya+'Z  b  for  all  a,  b  >  0.  Therefore,  the  tail  bound  (7.3.1) 
is  valid  when  f>  a  +  R/3.  Q 


7.7.2  The  General  Case 

Finally,  we  present  the  proof  of  the  intrinsic  Bernstein  inequality  for  an  independent  sum  of 
bounded  random  matrices. 

Proof  of  Corollary  7.3.2.  Suppose  that  {Sk }  is  a  finite  sequence  of  independent  random  matrices 
that  satisfy 


ESfc  =  0  and  ||Sfc||<f?. 
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Form  the  sum  Z  =  Y.k^k-  As  in  the  proof  of  Corollary  6.2.1,  we  derive  the  result  by  applying 
Theorem  7.3.1  to  the  Hermitian  dilation  Y  =  JfTiZ).  The  only  new  point  that  requires  attention 
is  the  definition  of  the  intrinsic  dimension  of  Z.  From  the  statement  of  Theorem  7.3.1,  we  have 

d{Y)  =  intdim(E(F2))  =  intdim  (Ef^F^Z)2))  =  intdim 


E(ZZ*)  0 
0  E(Z*Z) 


The  last  identity  arises  from  the  formula  (2.1.12)  for  the  square  of  the  dilation.  We  determine  that 
the  appropriate  definition  for  the  intrinsic  dimension  parameter  of  Z  is 


d  ( Z )  =  intdim 


E(ZZ*) 

0 


0 

E(Z*Z) 


This  point  completes  the  argument. 


□ 


7.8  Notes 

At  present,  there  are  two  different  ways  to  improve  the  dimensional  factor  that  appears  in  matrix 
concentration  inequalities. 

First,  there  is  a  sequence  of  matrix  concentration  results  where  the  dimensional  parameter 
is  bounded  by  the  total  rank  of  the  random  matrix.  The  first  bound  of  this  type  is  due  to  Rudel- 
son  [Rud99].  Oliveira’s  results  in  [OlilOb]  also  exhibit  this  reduced  dimensional  dependence.  A 
subsequent  paper  [MZ11]  by  Magen  and  Zouzias  contains  a  related  argument  that  gives  similar 
results.  We  do  not  discuss  this  class  of  bounds  here. 

The  idea  that  the  dimensional  factor  should  depend  on  metric  properties  of  the  random  ma¬ 
trix  appears  in  a  paper  of  Hsu,  Kakade,  and  Zhang  [HKZ12b] .  They  obtain  a  bound  that  is  similar 
to  Theorem  7.3.1.  Unfortunately,  their  argument  is  complicated,  and  the  results  it  delivers  are 
less  refined  than  the  ones  given  here. 

Theorem  7.3.1  is  essentially  due  to  Stanislav  Minsker  [Minll].  His  approach  leads  to  some¬ 
what  sharper  bounds  than  the  approach  in  the  paper  of  Hsu-Kakade-Zhang,  and  his  method  is 
easier  to  understand. 

We  present  a  new,  general  approach  that  delivers  intrinsic  dimension  bounds.  The  intrinsic 
Chernoff  bounds  that  emerge  from  our  framework  are  new.  The  proof  of  the  intrinsic  Bernstein 
bound,  Theorem  7.3.1,  can  be  interpreted  as  a  distillation  of  Minsker’s  argument.  Indeed,  many 
of  the  specific  calculations  already  appear  in  Minsker’s  paper.  We  have  obtained  constants  that 
are  marginally  better. 


Matrix  Concentration:  Resources 


This  annotated  bibliography  describes  some  papers  that  involve  matrix  concentration  inequali¬ 
ties.  Right  now,  this  presentation  is  heavily  skewed  toward  theoretical  results,  rather  than  appli¬ 
cations  of  matrix  concentration.  It  favors,  unapologetically,  the  work  of  the  author.  Additional 
papers  may  be  included  at  a  later  time. 

Exponential  Matrix  Concentration  Inequalities 

We  begin  with  papers  that  contain  the  most  current  results  on  matrix  concentration. 

•  [Trolld].  These  lecture  notes  are  based  heavily  on  the  research  described  in  this  paper. 
This  work  identifies  Lieb’s  Theorem  [Lie73,  Thm.  6]  as  the  key  result  that  animates  expo¬ 
nential  moment  bounds  for  random  matrices.  Using  this  technique,  the  paper  develops 
the  bounds  for  matrix  Gaussian  and  Rademacher  series,  the  matrix  Chernoff  inequalities, 
and  several  versions  of  the  matrix  Bernstein  inequality.  In  addition,  it  contains  a  matrix 
Hoeffding  inequality  (for  sums  of  bounded  random  matrices),  a  matrix  Azuma  inequal¬ 
ity  (for  matrix  martingales  with  bounded  differences),  and  a  matrix  bounded  difference 
inequality  (for  matrix- valued  functions  of  independent  random  variables). 

•  [Trol2] .  This  note  describes  a  simple  proof  of  Lieb’s  Theorem  that  is  based  on  the  joint  con¬ 
vexity  of  quantum  relative  entropy.  This  reduction,  however,  still  involves  a  deep  convexity 
theorem. 

•  [OlilOa] .  Oliveira’s  paper  uses  an  ingenious  argument,  based  on  the  Golden-Thompson 
inequality  (3.3.3),  to  establish  a  matrix  version  of  Freedman’s  inequality.  This  result  is, 
roughly,  a  martingale  version  of  Bernstein’s  inequality.  This  approach  has  the  advantage 
that  it  extends  to  the  fully  noncommutative  setting  [JZ12] .  Oliveira  applies  his  results  to 
study  some  problems  in  random  graph  theory. 

•  [Trolla].  This  paper  shows  that  Lieb's  Theorem  leads  to  a  Freedman-type  inequality  for 
matrix- valued  martingales.  The  associated  technical  report  [Trollc]  describes  additional 
results  for  matrix-valued  martingales. 

•  [GT1 1] .  This  article  explains  how  to  use  the  Lieb-Seiringer  Theorem  [LS05]  to  develop  tail 
bounds  for  the  interior  eigenvalues  of  a  sum  of  independent  random  matrices.  It  con¬ 
tains  a  Chernoff-type  bound  for  a  sum  of  positive-semidefmite  matrices,  as  well  as  several 
Bernstein- type  bounds  for  sums  of  bounded  random  matrices. 
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•  [MJC+12].  This  paper  contains  a  strikingly  different  method  for  establishing  matrix  con¬ 
centration  inequalities.  The  argument  is  based  on  work  of  Sourav  Chatterjee  [Cha07]  that 
shows  how  Stein’s  method  of  exchangeable  pairs  [Ste72]  leads  to  probability  inequalities. 
This  technique  has  two  main  advantages.  First,  it  gives  results  for  random  matrices  that  are 
based  on  dependent  random  variables.  In  particular,  the  results  apply  to  sums  of  indepen¬ 
dent  random  matrices.  Second,  it  delivers  both  exponential  moment  bounds  and  polyno¬ 
mial  moment  bounds  for  random  matrices.  Indeed,  the  paper  describes  a  Bernstein-type 
exponential  inequality  and  also  a  Rosenthal-type  polynomial  moment  bound.  Further¬ 
more,  this  work  contains  what  is  arguably  the  simplest  known  proof  of  the  noncommuta- 
tive  Khintchine  inequality. 

•  [CGT12a,  CGT12b].  The  primary  focus  of  this  paper  is  to  analyze  a  specific  type  of  proce¬ 
dure  for  covariance  estimation.  The  appendix  contains  a  new  matrix  moment  inequality 
that  is,  roughly,  the  polynomial  moment  bound  associated  with  the  matrix  Bernstein  in¬ 
equality. 

•  [Koll  1] .  These  lecture  notes  use  matrix  concentration  inequalities  as  a  tool  to  study  some 
estimation  problems  in  statistics.  They  also  contain  some  matrix  Bernstein  inequalities  for 
unbounded  random  matrices. 

•  [GN] .  Gross  and  Nesme  show  how  to  extend  Hoeffding’s  method  for  analyzing  sampling 
without  replacement  to  the  matrix  setting.  This  result  can  be  combined  with  a  variety  of 
matrix  concentration  inequalities. 

•  [Trolle].  This  paper  combines  the  matrix  Chernoff  inequality,  Theorem  5.1.1,  with  the 
argument  from  [GN]  to  obtain  a  matrix  Chernoff  bound  for  a  sum  of  random  positive- 
semidehnite  matrices  sampled  without  replacement  from  a  fixed  collection.  The  result  is 
applied  to  a  random  matrix  that  plays  a  role  in  numerical  linear  algebra. 

Bounds  with  Intrinsic  Dimension  Parameters 

The  following  works  contain  matrix  concentration  bounds  that  depend  on  a  dimension  param¬ 
eter  that  may  be  smaller  than  the  ambient  dimension  of  the  matrix. 

•  [OlilOb] .  Oliveira  shows  how  to  develop  a  version  of  Rudelson’s  inequality  [Rud99]  using 
a  variant  of  the  Ahlswede-Winter  argument  [AW02] .  This  paper  is  notable  because  the 
dimensional  factor  is  controlled  by  the  maximum  rank  of  the  random  matrix,  rather  than 
the  ambient  dimension. 

•  [MZ11].  This  work  contains  a  matrix  Chernoff  bound  for  a  sum  of  independent  positive- 
semidehnite  random  matrices  where  the  dimensional  dependence  is  controlled  by  the 
maximum  rank  of  the  random  matrix.  The  approach  is,  essentially,  the  same  as  the  ar¬ 
gument  in  Rudelson’s  paper.  The  paper  applies  these  results  to  study  randomized  matrix 
multiplication  algorithms. 

•  [ffKZ12b].  This  paper  describes  a  method  for  proving  matrix  concentration  inequalities 
where  the  ambient  dimension  is  replaced  by  the  intrinsic  dimension  of  the  matrix  vari¬ 
ance.  The  argument  is  based  on  an  adaptation  of  the  proof  in  [Trolla].  The  authors  give 
several  examples  in  statistics  and  machine  learning. 
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•  [Mini  1] .  This  work  presents  a  more  refined  technique  for  obtaining  matrix  concentration 
inequalities  that  depend  on  the  intrinsic  dimension,  rather  than  the  ambient  dimension. 

The  Ahlswede-Winter  Method 

In  this  section,  we  list  some  papers  that  use  the  ideas  from  the  Ahslwede-Winter  paper  [AW02]  to 
obtain  matrix  concentration  inequalities.  In  general,  these  results  have  suboptimal  parameters, 
but  they  played  an  important  role  in  the  development  of  this  field. 

•  [AW02] .  The  original  paper  of  Ahlswede  and  Winter  describes  the  matrix  Laplace  trans¬ 
form  method,  along  with  a  number  of  other  fundamental  results.  They  show  how  to  use 
the  Golden-Thompson  inequality  to  bound  the  trace  of  the  matrix  mgf,  and  they  use  this 
technique  to  prove  a  matrix  Chernoff  inequality  for  sums  of  independent  and  identically 
distributed  random  variables.  Their  main  application  concerns  quantum  information  the¬ 
ory. 

•  [CM08].  Christofides  and  Markstrom  develop  a  Hoeffding-type  inequality  for  sums  of 
bounded  random  matrices  using  the  Ahlswede-Winter  argument.  They  apply  this  result 
to  study  random  graphs. 

•  [Gro  11].  Gross  presents  a  matrix  Bernstein  inequality  based  on  the  Ahlswede-Winter  method, 
and  he  uses  it  to  study  algorithms  for  matrix  completion. 

•  [Recll].  Recht  describes  a  different  version  of  the  matrix  Bernstein  inequality,  which  also 
follows  from  the  Ahlswede-Winter  technique.  His  paper  also  concerns  algorithms  for  ma¬ 
trix  completion. 

Noncommutative  Moment  Inequalities 

We  conclude  with  an  overview  of  some  major  works  on  bounds  for  the  polynomial  moments 
of  a  noncommutative  martingale.  Sums  of  independent  random  matrices  provide  one  concrete 
example  where  these  results  apply.  The  results  in  this  literature  are  as  strong,  or  stronger,  than 
the  exponential  moment  inequalities  that  we  have  described  in  these  notes.  Unfortunately,  the 
proofs  are  typically  quite  abstract  and  difficult,  and  they  do  not  usually  lead  to  explicit  constants. 
Recently  there  has  been  some  cross-fertilization  between  noncommutative  probability  and  the 
field  of  matrix  concentration  inequalities. 

Note  that  “noncommutative”  is  not  synonymous  with  “matrix”  in  that  there  are  noncom¬ 
mutative  von  Neumann  algebras  much  stranger  than  the  familiar  algebra  of  finite-dimensional 
matrices  equipped  with  the  operator  norm. 

•  [TJ74] .  This  classic  paper  gives  a  bound  for  the  expected  trace  of  an  even  power  of  a  matrix 
Rademacher  series.  These  results  are  important,  but  they  do  not  give  the  optimal  bounds. 

•  [LP86].  This  paper  gives  the  first  noncommutative  Khintchine  inequality,  a  bound  for  the 
expected  trace  of  an  even  power  of  a  matrix  Rademacher  series  that  depends  on  the  matrix 
variance. 

•  [LPP91] .  This  work  establishes  an  optimal  version  of  the  noncommutative  Khintchine  in¬ 
equality. 
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•  [BucOl,  Buc05].  These  papers  prove  optimal  noncommutative  Khintchine  inequalities  is 
more  general  settings. 

•  [JX03,  JX08] .  These  papers  establish  noncommutative  versions  of  the  Burkholder-Davis- 
Gundy  inequality  for  martingales.  They  also  give  an  application  of  these  results  to  random 
matrix  theory. 

•  [JX05] .  This  paper  contains  an  overview  of  noncommutative  moment  results,  along  with 
information  about  the  optimal  rate  of  growth  in  the  constants. 

•  [JZ1 1] .  This  paper  describes  a  fully  noncommutative  version  of  the  Bennett  inequality.  The 
proof  is  based  on  the  Ahlswede-Winter  method  [AW02] . 

•  [JZ12].  This  work  shows  how  to  use  Oliveira’s  argument  [OlilOa]  to  obtain  some  results  for 
fully  noncommutative  martingales. 

•  [MJC+ 12] .  This  work,  described  above,  includes  a  section  on  matrix  moment  inequalities. 
This  paper  contains  what  are  probably  the  simplest  available  proofs  of  these  results. 
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Random  Matrices  in  Statistics 


**  Covariance  estimation  for  the  multivariate  normal  distribution 


3.  Multi-variate  Distribution.  Use  of  Quadratic  co-ordinates. 

A  comparison  of  equation  (8)  with  the  corresponding  results  (1)  and  (2)  for 
uni-variate  and  bi-variate  sampling,  respectively,  indicates  the  form  the  general 
result  may  be  expected  to  take.  In  fact,  we  have  for  the  simultaneous  distribution 
in  random  samples  of  the  n  variances  (squared  standard  deviations)  and  the 

n^n 2  product  moment  coefficients  the  following  expression : 
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-jP ,  A  being  the  determinant 


\ppq\<P>  9  =  1.  2.  3,  ...  n, 
and  Ap,  the  minor  of  p„  in  A. 


John  Wishart 


[Refs]  Wishart,  Biometrika  1928.  Photo  from  appr endr e-math,  inf  o. 
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Random  Matrices  in  Numerical  Linear  Algebra 


**  Model  for  floating-point  errors  in  LU  decomposition 


John  von  Neumann 


now  combining  (8.6)  and  (8.7)  we  obtain  our  desired  result: 

(m)n-l/2g-r»7rl/2gn.  2»-2 


Prob  (X  >  2a2rn)  < 


(8.8) 


i rn 


n— 1 


(r  —  1  )n 


X 


1 


4(r  —  1  )(nra)1/2 


We  sum  up  in  the  following  theorem: 

(8.9)  The  probability  that  the  upper  bound  |  A  |  of  the  matrix  A 
of  (8.1)  exceeds  2.72 an112  is  less  than  .027X2 ~nn~lli,  that  is,  with 
probability  greater  than  99%  the  upper  bound  of  A  is  less  than 
2.72 <rn112  for  n  —  2,  3,  •  •  •  . 

This  follows  at  once  by  taking  r  =  3. 70. 


[Refs]  von  Neumann  and  Goldstine,  Bull.  AMS  1947  and  Proc.  AMS  1951.  Photo  ©IAS  Archive. 
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Random  Matrices  in  Nuclear  Physics 


**  Model  for  the  Hamiltonian  of  a  heavy  atom  in  a  slow  nuclear  reaction 


Eugene  Wigner 


Random  sign  symmetric  matrix 

The  matrices  to  be  considered  are  2 N  +  1  dimensional  real  symmetric  matrices; 
iV  is  a  very  large  number.  The  diagonal  elements  of  these  matrices  are  zero, 
the  non  diagonal  elements  vik  —  vki  —  =tv  have  all  the  same  absolute  value  but 
random  signs.  There  are  91  =  2A'av+1)  such  matrices.  We  shall  calculate,  after 
an  introductory  remark,  the  averages  of  (/T)o 0  and  hence  the  strength  function 
S' (x)  =  <j(x).  This  has,  in  the  present  case,  a  second  interpretation:  it  also 
gives  the  density  of  the  characteristic  values  of  these  matrices.  This  will  be 
shown  first. 


[Refs]  Wigner,  Ann.  Math  1955.  Photo  from  Nobel  Foundation. 
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Randomized  Linear  Algebra 


Input:  An  m  X  n  matrix  A,  a  target  rank  k,  an  oversampling  parameter  p 
Output:  An  m  X  {k  +  p)  matrix  Q  with  orthonormal  columns 

1.  Draw  an  n  X  (k  +  p)  random  matrix  ft 

2.  Form  the  matrix  product  Y  =  Aft 

3.  Construct  an  orthonormal  basis  Q  for  the  range  of  Y 


[Ref]  Halko-Martinsson-T,  SIAM  Rev.  2011. 
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Other  Algorithmic  Applications 


**  Sparsification.  Accelerate  spectral  calculation  by  randomly  zeroing 
entries  in  a  matrix. 

**  Subsampling.  Accelerate  construction  of  kernels  by  randomly 
subsampling  data. 

**  Dimension  Reduction.  Accelerate  nearest  neighbor  calculations  by 
random  projection  to  a  lower  dimension. 

**  Relaxation  &  Rounding.  Approximate  solution  of  maximization 
problems  with  matrix  variables. 


[Refs]  Achlioptas-McSherry  2001  and  2007,  Spielman-Teng  2004;  Williams-Seeger  2001,  Drineas-Mahoney 
2006,  Gittens  2011;  Indyk-Motwani  1998,  Ailon-Chazelle  2006;  Nemirovski  2007,  So  2009... 
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Random  Matrices  as  Models 


**  High-Dimensional  Data  Analysis.  Random  matrices  are  used  to 
model  multivariate  data. 

**  Wireless  Communications.  Random  matrices  serve  as  models  for 
wireless  channels. 

**  Demixing  Signals.  Random  model  for  incoherence  when  separating 
two  structured  signals. 


[Refs]  Biihlmann  and  van  de  Geer  2011,  Koltchinskii  2011;  Tulino-Verdu  2004;  McCoy-T  2011. 
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Theoretical  Applications 


**  Algorithms.  Smoothed  analysis  of  Gaussian  elimination. 

**  Combinatorics.  Random  constructions  of  expander  graphs. 

**  High-Dimensional  Geometry.  Structure  of  random  slices  of  convex 
bodies. 

**  Quantum  Information  Theory.  (Counter)examples  to  conjectures 
about  quantum  channel  capacity. 


[Refs]  Sankar-Spielman-Teng  2006;  Pinsker  1973;  Gordon  1985;  Hayden-Winter  2008,  Hastings  2009. 
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he  Conventional  Wisdom 


“Random  Matrices  are  Tough!” 


[Refs]  youtube .  com/watch?v=NOOcvqTltAE,  most  monographs  on  RMT. 
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Principle  A 


“But... 

In  many  applications,  a  random  matrix  can 
be  decomposed  as  a  sum  of  independent 
random  matrices: 

n 

z  =  J2sk 

k= 1 
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Principle  B 


and 


There  are  exponential  concentration 
inequalities  for  the  spectral  norm  of  a  sum 
of  independent  random  matrices: 


>  t}  <  exp( 


99 
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aussian  Series 
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"he  Norm  of  a  Matrix  Gaussian  Series 


Theorem  1.  [Oliveira  2010,  T  2010]  Suppose 


are  fixed  matrices  with  dimension  rfi  x  3nd 


7^  72,73, 


are  independent  standard  normal  RVs. 


Define  d  :=  d\  -f  c?2  3nd  the  variance  parameter 


cr  :  =  max 


Then 


P{||j]fc7fcSfe||  >i}  <d-e_t2/2<j2- 


[Refs]  Tomczak-Jaegerman  1974,  Lust-Picquard  1986,  Lust-Picquard-Pisier  1991,  Rudelson  1999, 
Buchholz  2001  and  2005,  Oliveira  2010,  T  2011.  Notes:  Cor.  4.2.1,  page  33. 
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"he  Norm  of  a  Matrix  Gaussian  Series 


Theorem  2.  [Oliveira  2010,  T  2010]  Suppose 


are  fixed  matrices  with  dimension  rfi  x  3nd 


7^  72,73, 


are  independent  standard  normal  RVs. 


Define  d  :=  d\  -f  c?2  3nd  the  variance  parameter 


cr  :  =  max 


Then 


E  ||V  7 kBk  <  \J  2u2  log  d. 


[Refs]  Tomczak-Jaegerman  1974,  Lust-Picquard  1986,  Lust-Picquard-Pisier  1991,  Rudelson  1999, 
Buchholz  2001  and  2005,  Oliveira  2010,  T  2011.  Notes:  Cor.  4.2.1,  page  33. 
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'he  Variance  Parameter 


**  Define  the  matrix  Gaussian  series  Z  =  Ik^k 

^  The  variance  parameter  ct2(Z)  derives  from  the  “mean  square  of  Z" 
*+  But  a  general  matrix  has  two  different  squares! 

n  n  n 

E(ZZ*)  =  £$>(  W)BJBk  =  EB^ 

j=l k = 1  k= 1 

n  n  n 

E(Z*Z)  =  J2  E  ni3lk)B*Bk  =  B*kBk 

.7  =  1  k=  1  k=  1 


Variance  parameter  cr2(Z) 


max{||E(ZZ*)||  , 


||E(Z*Z)||}. 
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Schematic  of  Gaussian  Series  Tail  Bound 
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Warmup:  A  Wigner  Matrix 


**  Let  {7 jk  :  1  <  j  <  k  <  n}  be  independent  standard  normal  variables 


2*  A  Gaussian  Wigner  matrix: 


W  = 


0 

712 

713 

•  •  •  7ln 

712 

0 

723 

72  n 

713 

■■  to 

CO 

0 

e 

CO  ■■ 

A- 

7ln 

72  n 

■  0 

s 

■  T—\ 

s 

*+  Problem:  What  isEHWII? 


Notes:  §4.4.1,  page  35. 
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The  Wigner  Matrix,  qua  Gaussian  Series 


**  Express  the  Wigner  matrix  as  a  Gaussian  series: 

=  ^  ^  7 jk(Bjk  *T  ~^kj) 

l<j <k<n 


*+  The  symbol  Ej&  denotes  the  n  x  n  matrix  unit 


t 

k 
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Norm  Bound  for  the  Wigner  Matrix 


Need  to  compute  the  variance  parameter  a2{W) 

Summands  are  symmetric,  so  both  matrix  squares  are  the  same: 


t  (Ejfc + e kj)2 

1< j<k<n 


Ej^E 

l<j  <k<n 

(0  +  Ejj  +  Efcfc  +  0)  =  (n  —  1)  In 

1 <7  <  k  <  n 


fcj 


T-  E^jEjfc  -T  E^E/^) 


^  Thus,  the  variance  a2(W) 


Conclusion:  E||W||  <  y^2( u  —  1)  log(2n) 

**  Optimal:  E  \\W\\  ~  2^^ 


[Refs]  Wigner  1955,  Davidson-Szarek  2002,  Tao  2012. 
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Example:  A  Gaussian  Toeplitz  Matrix 


**  Let  {7fc}  be  independent  standard  normal  variables 


**  An  unsymmetric  Gaussian  Toeplitz  matrix: 


7o 

7-i 

7i 

7o 

7i 

. . . 

7n  — 1 

T  = 

_7—  (n  —  1) 

7-i 

7o 

7i 

7-i 

7o 

7-i 

7i 

7o 

**  Problem:  What  is  E  T  ? 


Notes:  §4.6,  page  38. 


Joel  A.  Tropp,  User-Friendly  Tools  for  Random  Matrices,  NIPS,  3  December  2012 


24 


The  Toeplitz  Matrix,  qua  Gaussian  Series 


**  Express  the  unsymmetric  Toeplitz  matrix  as  a  Gaussian  series: 

n— 1  n— 1 

T  =  7oI+^7fcSfe  +  ^7-fc(Sfe)* 

k = 1  fc=l 


i*-  The  matrix  S  is  the  shift-up  operator  on  n-dimensional  column  vectors: 


1 

0  1 


0  1 
0 
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Variance  Calculation  for  the  Toeplitz  Matrix 


**  Note  that 

n  —  k  n 

(Sk)(Skr  =  J>jj  and  (Sk)*(Sk)  =  ^  Ejj. 

j= 1  j=k+l 

Both  sums  of  squares  take  the  form 

n— 1  n— 1 

I2  + 

fc=l  k= 1 


n  —  1 

n  —  k  n 

n 

n-j 

j-i 

i+E 

=  E 

i  +  E^E1 

k  =  l 

.7  =  1  j  =  fc+l 

i= i 

fc=i 

fc=i  _ 

n 

=  ^(i  +  (n  -  j)  +  (j  -  i))  Ejj  =  n  I 

J  = 1 
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Norm  Bound  for  the  Toeplitz  Matrix 


**  The  variance  parameter  cr2(T)  =  ||nln||  =  n 


**  Conclusion:  E  rj 

V  <  A/2nlog(2n) 

*+  Optimal:  E  ||T||  ~  const  •  ^2n\ogn 

The  optimal  constant  is  at  least  0.8288... 

[Refs]  Bryc-Dembo-Jiang  2006,  Meckes  2007,  Sen-Virag  2011,  T  2011 
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erno 


nequa 
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he  Matrix  Chernoff  Bound 


Theorem  3.  [T  2010]  Suppose 

**  Xi,  X 2,  X3, . . .  are  random  psd  matrices  with  dimension  d,  and 

^  /^max(X^)  ^  each  k. 

Then 


P  |  Amin  (y,  Affc)  <  (1  -  t)  •  /Uminj  <  d  ■ 

P  |Amax  —  (1  +  t)  '  Mmax|  <  d  • 


,  —  £  ”1 


L(i -t)1-*] 

t  ~\  Mmax/  -R 


_(1  +  t)1+t_ 


where  /xm in  .  Amjn  (A  3-  P  )  and  fi max  •  Amax  ( ^  E  ) . 


[Refs]  Ahlswede-Winter  2002,  T  2011.  Notes:  Thm.  5.1.1,  page  48. 
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he  Matrix  Chernoff  Bound 


Theorem  4.  [T  2010]  Suppose 

**  Xi,  X2,  X3, . . .  are  random  psd  matrices  with  dimension  d,  and 
^  /^max(X^)  R  for  each  k. 

Then 

^min  ^  ^  0.6  /imin  R  d 

^  ^max  j  —  1’^MmaxH-  R\og  d 

where  fl min  • —  ^min  ^^d  /imax  • —  ^max  ^  X~fe)  . 

[Refs]  Ahlswede-Winter  2002,  T  2011.  Notes:  Thm.  5.1.1,  page  48. 
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Example:  Random  Submatrices 


Fixed  matrix,  in  captivity: 


Ci  C2  C3  C4 


"tn 


-  dxn 


Random  matrix,  formed  by  picking  random  columns: 


C2  C3 


'71 


-  dxn 


t  t  t 


Problem:  What  is  the  expectation  of  a\(Z)l  What  about  ad(Z)l 

Notes:  §5.2.1,  page  49. 
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Model  for  Random  Submatrix 


**  Let  C  be  a  fixed  d  x  n  matrix  with  columns  ci, . . . ,  cn 

*+  Let  5i, . . . ,  5n  be  independent  0-1  random  variables  with  mean  s/n 

*+  Define  A  =  diag(5i, . . . ,  5n) 

*+  Form  a  random  submatrix  Z  by  turning  off  columns  from  C 


Z  =  CA  = 

C 1  C2  ...  Cn 

i 

1 

1 _ 

dxn 

5n 

*+  Note  that  Z  typically  contains  about  s  nonzero  columns 
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The  Random  Submatrix,  qua  PSD  Sum 


**  The  largest  and  smallest  singular  values  of  Z  satisfy 


a^Z)2  =  A  max(ZZ*) 

&d{Z)2  =  Amin  (Z  Z*) 


Define  the  psd  matrix  Y  =  ZZ* ,  and  observe  that 

Y  =  ZZ*  =  CA2C*  =  CAC*  =  4  CfcCfc 

/c  —  1 


^  We  have  expressed  7  as  a  sum  of  independent  psd  random  matrices 
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Preparing  to  Apply  the  Chernoff  Bound 


**  Consider  the  random  matrix 


Y  =  4  ckc*k 

*+  The  maximal  eigenvalue  of  each  summand  is  bounded  as 


R  ixicLx^  Amax(5fe  ckck)  niax/j 


2 


^  The  expectation  of  the  random  matrix  Y  is 

E<y )  =  s  ^  ;  cc 


* 


^  The  mean  parameters  satisfy 


Mmax  —  Amax(El^)  —  < 7\(C )  and  /^min  —  Amin(]El^) 

n 


n 
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What  the  Chernoff  Bound  Says 


Applying  the  Chernoff  bound,  we  reach 


s  - 2  ,  2 


E  (Ji(Z)  =  EAmax(F)  <  1.8  • -cri(C)  +maxfe||cfe 

n 


log  d 


E  [ad(Z)2]  =  EAmin(F)  >  0.6  •  -ad{C)2  -  maxfc  ck 

n 


log  d 


*+  Matrix  C  has  n  columns;  the  random  submatrix  Z  includes  about  s 
The  singular  value  ( Z )2  inherits  an  s/n  share  of  cr^(C)2  for  i  =  1,  d 
*+  Additive  correction  reflects  number  d  of  rows  of  C,  max  column  norm 


[Gittens-T  2011]  Remaining  singular  values  have  similar  behavior 
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Key  Example:  Unit-Norm  Tight  Frame 


A  d  x  n  unit-norm  tight  frame  C  satisfies 


n 


CC*  =  -Id  and 

d 


CjJL  =  1  for  k  =  1,  2, . . . ,  n 


*+  Specializing  the  inequalities  from  the  previous  slide. 


E  \(Ji(Z)2]  <  1.8  •  —  +  logd 

d 


E  \crd(Z)2 1  >  0.6  •  —  —  logd 

d 


*+  Choose  s  >  1.67 d log d  columns  for  a  nontrivial  lower  bound 
**  Sharp  condition  s  >  dlogd  also  follows  from  matrix  Chernoff  bound 

[Refs]  Rudelson  1999,  Rudelson-Vershynin  2007,  T  2008,  Gittens-T  2011,  T  2011,  Chretien-Darses  2012. 
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Matrix 

Bernstein  Inequa 
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The  Matrix  Bernstein  Inequality 


Theorem  5.  [Oliveira  2010,  T  2010]  Suppose 


»  S\,  S2,  S3,.. .  are  indep.  random  matrices  with  dimension  d\  x  d2, 
^  E  Sk  =  0  for  each  k,  and 


Sk  <  R  for  each  k. 


Then 


P 


s*||>'}<  d  •  exp 


-t2/ 2 


a2  +  Rt/3 


where  d  :=  d\  +  d2  and  the  variance  parameter 


cT2:=max{||^fcE(Sfc^)||,  |]T E(S*kSk)\\ } 


[Refs]  Gross  2010,  Recht  2011,  Oliveira  2010,  T  2011.  Notes:  Cor.  6.2.1,  page  64. 
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The  Matrix  Bernstein  Inequality 


Theorem  6.  [Oliveira  2010,  T  2010]  Suppose 


»  S\,  S2,  S3,.. .  are  indep.  random  matrices  with  dimension  di  x  rf2, 
^  E  Sk  =  0  for  each  k,  and 


Sk  <  R  for  each  k. 


Then 


E|[y^  Sk  <  a/ 2 (j2  log  d  +  log  d 
1 1  k 


where  d  :=  d\  +  c?2  and  the  variance  parameter 


a  :  =  max 


53fcE(sfcsj;)||,  ||EfeE^^)||} 


[Refs]  Gross  2010,  Recht  2011,  Oliveira  2010,  T  2011.  Notes:  Cor.  6.2.1,  page  64. 
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Example:  Randomized  Matrix  Multiplication 


Product  of  two  matrices,  in  captivity: 


BC* 


di  xn 


,* 

'1 

,* 

'2 

,* 

"3 

,* 

"4 


'n 


-  nxd2 


[Idea]  Approximate  multiplication  by  random  sampling 


[Refs]  Drineas-Mahoney-Kannan  2004,  Magen-Zouzias  2010,  Magdon-lsmail  2010,  Hsu-Kakade-Zhang 
2011  and  2012. 
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A  Sampling  Model  for  Tutorial  Purposes 


**  Assume 


'3 


=  1  and 


'3 


=  1  for  j  —  1,2, 


n 


^  Construct  a  random  variable  S'  whose  value  is  a  di  x  g?2  matrix: 

**  Draw  J  ~  uniform{1,  2, . . . ,  n} 

**  Set  S  =  n  •  bjCj 

The  random  matrix  S  is  an  unbiased  estimator  of  the  product  BC* 


ETi 

( 

3= 1 


n 


bicP 


in 

V{J  =  ]\  =  Y.,,M  =  BC 


* 


Approximate  BC*  by  averaging  m  independent  copies  of  S 


BC* 


Notes:  §6.4,  page  67. 
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Preparing  to  Apply  the  Bernstein  Bound  I 


**  Let  Sk  be  independent  copies  of  S ,  and  consider  the  average 

1  ^ 

z  =  -  V  sk 

m  ^k= 1 

*+  We  study  the  typical  approximation  error 


E  \\Z  —  BC*\\  =  — -E 

m 


,  m 


*+  The  summands  are  independent  and  E  Sk  =  BC*,  so  we  symmetrize : 


E  Z-BC *  <—  -Elly™  £kSk 


where  {ek}  are  independent  Rademacher  RVs,  independent  from  {Sk} 
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Preparing  to  Apply  the  Bernstein  Bound  II 


**  The  norm  of  each  summand  satisfies  the  uniform  bound 


R  =  Ups'll  =  11511  =  \\n  ■  {bjc*j) 


=  n 


' j 


cj 


=  n 


*+  Compute  the  variance  in  two  stages: 


77/  -  77/ 

E(SS*)  =  ^2j=i  n2(bjC*)(bjC*)*  P  { J  =  j}  =  n^-=1 

=  nBB* 

E (S*S)  =  n  CC* 


cj 


2  bJbj 


a  = 


max{iEr=iE(Sfc^)i  *  ie>*«>ii} 


=  max  {|| mn  •  BB* 
—  mn  •  max{||i7"2 


mn  •  CC 


* 


C II2} 
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What  the  Bernstein  Bound  Says 


Applying  the  Bernstein  bound,  we  reach 


E  Z-BC 


* 


2  ^ 
<  — E 
m 

2 

m  l 


Em 

k= i £kSk 


< 


o  \J  2  log(di  +  d2)  +  \R  log(c?i  +  d2) 


=  2 


n  log(di  -f-  d2)  r  D  nlog(di  +  d2) 

- max{  B  ,  C  }  +  - - 

m  dm 


[Q]  What  can  this  possibly  mean?  Is  this  bound  any  good  at  all? 
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Detour:  The  Stable  Rank 


**  The  stable  rank  of  a  matrix  is  defined  as 


srank(A)  :  = 


**  In  general,  1  <  srank(A)  <  rank(A) 

**  When  A  has  either  n  rows  or  n  columns,  1  <  srank(A)  <  n 


Assume  that  A  has  n  unit-norm  columns,  so  that  A 


=  n 


2 

**  When  all  columns  of  A  are  the  same,  ||A||  =  n  and  srank(A)  =  1 


2 

**  When  all  columns  of  A  are  orthogonal,  ||A||  =  1  and  srank(A)  =  n 
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Randomized  Matrix  Multiply,  Relative  Error 


**  Define  the  (geometric)  mean  stable  rank  of  the  factors  to  be 

s  :=  y//srank(^)  •  srank(C). 

**  Converting  the  error  bound  to  a  relative  scale,  we  obtain 


E 

Z  -  BC* 

B 

C 

<  2 


s  log(di  +  d2)  2  s  log(di  +  d2) 

m  3  m 


*+  For  relative  error  e  G  (0, 1),  the  number  m  of  samples  should  be 

m  >  Const  •  £-2  •  s  log(di  +  d2) 

The  number  of  samples  is  proportional  to  the  mean  stable  rank! 

**  We  also  pay  weakly  for  the  dimension  d\  x  d2  of  the  product  BC* 
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More  Things  in  Heaven  &  Earth 


**  [More  Bounds  for  Eigenvalues]  There  are  exponential  tail  bounds  for  maximum 
eigenvalues,  minimum  eigenvalues,  and  eigenvalues  in  between... 

**  [More  Exponential  Bounds]  There  is  a  matrix  Hoeffding  inequality  and  a  matrix 
Bennett  inequality,  plus  matrix  Chernoff  and  Bernstein  for  unbounded  matrices... 

**  [Matrix  Martingales]  There  is  a  matrix  Azuma  inequality,  a  matrix  bounded 
difference  inequality,  and  a  matrix  Freedman  inequality... 

**  [Dependent  Sums]  Exponential  tail  bounds  hold  for  some  random  matrices  based  on 
dependent  random  variables... 

**  [Polynomial  Bounds]  There  are  matrix  versions  of  the  Rosenthal  inequality,  the 
Pinelis  inequality,  and  the  Burkholder-Davis-Gundy  inequality... 

**  [Intrinsic  Dimension]  The  dimensional  dependence  can  sometimes  be  weakened... 

**  [The  Proofs!]  And  the  technical  arguments  are  amazingly  pretty... 

[Refs]  T  2011,  Gittens-T  2011,  Oliveira  2010,  Mackey  et  al.  2012,  ... 
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To  learn  more... 


E-mail:  jtroppOcms.caltech.edu 

Web:  http: //users . cms . caltech.edu/~jtropp 

Some  papers: 

“User-friendly  tail  bounds  for  sums  of  random  matrices,”  FOCM,  2011. 

“User-friendly  tail  bounds  for  matrix  martingales.”  Caltech  ACM  Report  2011-01. 

“Freedman’s  inequality  for  matrix  martingales,”  ECP,  2011. 

**  “A  comparison  principle  for  functions  of  a  uniformly  random  subspace,”  PTRF,  2011. 

“From  the  joint  convexity  of  relative  entropy  to  a  concavity  theorem  of  Lieb,”  PAMS,  2012. 

“Improved  analysis  of  the  subsampled  randomized  Hadamard  transform,”  AADA,  2011. 
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